We have enhanced VTA to enable a larger design space and greater workload support.
Fully pipelined ALU and GEMM units are added to VTA tsim, and memory width can now range between 8-64 bytes. ISA encoding is relaxed to support larger scratchpads. New instructions are added: 8-bit ALU multiplication for depthwise convolution, load with variable pad values for max pooling, and a clip opcode for common ResNet patterns. Additional layer support and better double buffering is also present.
A big performance gain is seen just with fully pipelined ALU/GEMM: ~4.9x fewer cycles with minimal area change to run ResNet-18. Configs featuring a further ~11.5x decrease in cycle count at a cost of ~12x more area are possible. 10s of points on the area-performance pareto curve are shown, balancing execution unit sizing, memory width, and scratchpad size. Finally, VTA is now able to run Mobilenet 1.0 and all ResNet layers.
All features are in open-source forks, some already have been upstreamed.