There is a growing need to deploy machine learning for a wide array of tasks and hardware platforms. Such deployment scenarios require tackling multiple challenges, including identifying a model architecture that can achieve suitable predictive accuracy, and optimizing (for example, by using TVM) a model to satisfy underlying hardware-specific systems constraints such as latency.
In this talk, I will describe our work on solving these problems jointly via a so-called direct search. I will introduce a novel direct search method called SONAR that interleaves these two search problems and aims to jointly optimize for predictive accuracy and inference latency. SONAR applies early stopping to both search processes to perform direct search efficiently.