We will present a new approach to scheduling machine learning models called ‘cascading’. Cascading is a form of inter-operator scheduling that can significantly reduce the working memory requirements of a model, allowing big models to run on tiny devices. It can also improve performance for memory-bound processors such as NPUs. In this talk, we’ll cover:
This session is broken into two parts, a 20 minute talk followed by a 10 minute community breakout session.