TVM has made it a lot easier to port deep learning frameworks for new hardware backends but developers still need to fine tune performance for their specific hardware. This requires the identification of performance bottlenecks which can only be achieved with help from reliable profiling tools. The debug graph executor is very useful for coarse grained function level information. However, in most cases, that’s not sufficient and leaves developers looking for detailed finer-grained loop level information. We are working on a TIR pass that instruments the code with a new builtin which can be used to enable profiling. The profiling mechanism we have developed enables function and loop level profiling and allows hardware targets to collect processor-specific performance metrics. The talk will describe the design and implementation of this generalized profiling mechanism. We will also present the techniques used to reduce profiling overhead on storage-constrained embedded processors.
This session is split into two parts, a 20 minute talk and a 10 minute community discussion.