I keep track of approximately 50 companies that are building deep neural network (DNN) accelerator chips and I have been criticized for not having a complete list. While there is no doubt that the market for deep learning chips is and will be growing leaps and bounds, it is hard to imagine seeing more than a dozen viable players with staying power. Clearly the segment will go through consolidation and unfortunately many vendors will not survive.
Marketing 101 has taught us that just having a superior performance is not sufficient to win in a competitive environment. Product positioning, differentiation, and promotion are paramount factors that will set winners apart. While superior architecture, tight integration with hierarchical memory, and advanced process nodes are critical in having an edge in the market, there are many other areas that can be exploited by chip vendors to differentiate.
This leads us to the concept of “approximation” as means of product differentiation. In a nutshell, one can view approximation as doing more with resources at hand or do the same with less resources. Finding ways to eliminate duplicate connections, weights, and neurons in DNNs in addition to utilization of the most efficient quantization (numerical representation format) schemes can go a long way in minimizing the power dissipation, latency, die area, and memory footprint in inference accelerators. The term “DNN approximation” is a catch-all phrase encompassing all of the above. To be clear, DNN approximation is less desirable in training since typically nothing is spared to maximize the accuracy of the DNNs during training. On the contrary, using aggressive approximation techniques in inference DNNs make a lot of sense since low power dissipation, latency, cost, and memory footprint are huge factors in inference settings specially in edge applications.
To appreciate the benefits of the approximation let me present the following example. The original LeNet-5 MNIST (handwritten digit) classification task required nearly 700k arithmetic operations per classification. Fast forward a few years, a more recent VGG16 model for classifying ImageNet required nearly 35G arithmetic operations per classification. The good news is that image classification models are getting better, and the bad news is that they will continue to grow leaps and bounds.
So, what does approximation have to do with product differentiation? I am firmly convinced that providing tools, specialized hardware, and various hooks and handles to help customers exploit most sensible approximation strategies can be a potent differentiation strategy for inference chip vendors. In a nutshell, help your customers implement the most effective approximation techniques. Few ideas come to mind:
Accelerator chip vendors can provide tools that can examine specific customer DNN implementation and propose the most effective approximation strategies for that specific implementation
The subject of DNN approximation is a thriving research area and keeping the customers abreast of the latest findings will undoubtedly be a great service. Why not have a research alert to help customers to stay ahead of the curve
What if vendors can provide seamless conversion tools that can take an existing DNN implementation and generate optimized approximated equivalents
What if devices are able to maintain a log of least and most used nodes of the DNN enabling the users to further optimize their specific implementation
While most vendors that I talked to are fully aware of approximation techniques not all are dedicating sufficient time and resources to exploit this topic.
In general, DNN approximation algorithms can be classified in two major categories: quantization and weight reduction. Weight reduction algorithms attempt to remove redundancies that can lead to dramatic simplifications. Quantization techniques, on the other hand; attempt to reduce the precision of weights and DNN parameters. Below I have attempted to provide a brief summary of various approximation concepts.
The format of numerical representation (quantization) of DNN’s weights and parameters have profound impact on die size, power dissipation, and latency. Floating point quantization is the most flexible format but comes with a steep price when it comes to resource use, power, and latency. Fixed point representations have shown great promise with minimal impact on accuracy. A tremendous amount of research is underway to explore various other quantization techniques to further improve the efficiency. Logarithmic quantization, binarization, and adaptive quantization are just a few examples that have shown tremendous promise.
2. Weight Reduction
Weight reduction is an extensive topic. The following are just a few methods to accomplish this goal.
Pruning is merely removing redundant connections and/or neurons in a DNN and can lead to sizable savings in computational cost. There are many pruning strategies but mostly an iterative pruning process is presently being utilized. The process starts with a normal DNN training followed by pruning followed by retraining. The objective of the retraining is to force the remaining connections to compensate for the losses in accuracy due to pruning. This cycle can repeat several times in order to achieve maximum compression (savings).
Typically, in large DNNs there exist groups of connections (buckets) that have the same or similar weights. By assigning a single centroid weight to all the elements in a specific bucket, the multiplication operation can be replaced by a simple table lookup leading to dramatic reduction in computational complexities. Table lookups are far faster and efficient compared to the actual multiplication. Such DNNs are typically trained normally and clustering algorithms are used to define buckets of connections. Once buckets are established and fixed weights are assigned, the training is repeated to adjust the remaining weights to compensate for inaccuracies introduced by weight sharing.
Simplification of Activation Function
There is an activation function associated with every neuron (network node) meaning that a complex network with millions of nodes will have as many activation functions. One can think of an activation function as the computational block that converts the result of the multiply accumulate operation to a range-bound probability value (e.g. 0 to 1 or -1 to +1). Most activation functions are nonlinear (e.g. sigmoid, tanh) thus computationally intensive. Finding means of simplifications can go a long way in saving computational cycles. One of the most promising methods for simplifying non-linear activation functions is replacing them with piecewise linear approximation.
Hope you have benefited from this issue. Please forward to others if you find value in this content. I always welcome feedback.