Chapter 15: Optimizing for Latency

First Make Sure It Matters

Comment out the tflite::MicroInterpreter::Invoke() to see how much time is spent in the interpreter. If the difference is not significant, then you should focus on optimizing other parts of your application.

Efficient Neural Networks

MobileNet:
MobileNetV1: Utilizes depthwise separable convolutions to build a lightweight deep neural network.
MobileNetV2: Introduces inverted residuals and linear bottlenecks to improve upon V1.
MobileNetV3: Employs a combination of hardware-aware network architecture search (NAS) and complementary search techniques to further enhance performance.
SqueezeNet:
Employs squeeze and expansion modules to significantly reduce the number of parameters without sacrificing accuracy.
ShuffleNet:
Utilizes pointwise group convolution and channel shuffle to reduce computational cost while maintaining accuracy.
ShuffleNet V2: An improvement over the original, aimed at reducing memory access cost and model complexity.
EfficientNet:
Systematically scales all dimensions of the architecture (width, depth, resolution) using a compound coefficient.
EfficientNetV2: Incorporates training and architecture improvements for faster training and better efficiency.
Xception:
Extends the idea of depthwise separable convolutions by adding an entry flow, middle flow, and exit flow to the architecture, improving upon the concepts in MobileNet.
DenseNet:
Features densely connected convolutional networks where each layer is connected to every other layer in a feed-forward fashion, improving efficiency.
ENet (Efficient Neural Network):
Designed for tasks that require real-time processing, such as semantic segmentation, particularly in mobile applications.
NasNet:
Employs Neural Architecture Search (NAS) to automatically generate model architectures that are both performant and efficient.
MnasNet:
Uses NAS to optimize for both accuracy and mobile latency.
GhostNet:
Introduces 'ghost' modules that generate more feature maps from cheap operations, improving efficiency.
PeleeNet:
- A real-time object detection system that runs at high speed on mobile devices with low computational power.
CondenseNet:
- Uses learned group convolutions, increasing the efficiency of deep neural networks by reducing redundancy.

FLOPs (Floating Point Operations)

FLOPs, or Floating Point Operations, generally refer to the number of floating-point operations a computer can perform in a second. This term is widely used as a metric to measure the performance of a computer's processor, especially when it comes to tasks involving numerical calculations like those found in scientific computing, simulations, or machine learning.

Here's a breakdown of the term:

Floating Point: This refers to numbers that have a decimal point (floating point numbers), as opposed to integers (whole numbers). Operations on floating-point numbers are often more computationally intensive than integer operations.
Operation: This typically refers to mathematical operations such as addition, subtraction, multiplication, and division.

When it comes to measuring computational performance in the context of neural networks and deep learning, FLOPs can be a useful metric to gauge the complexity of a model. For example:

Model Complexity: The number of FLOPs required for a single forward pass indicates the computational complexity of a neural network model. Models with more layers and larger input sizes generally have a higher number of FLOPs.
Inference and Training: FLOPs can be used to estimate how fast a neural network can run (inference) and how long it might take to train. A model with fewer FLOPs can often be inferred more quickly, making it suitable for real-time applications.
Energy Consumption: There's often a correlation between the number of FLOPs and the energy consumption of the model, which is an important consideration for battery-powered devices.

When people talk about reducing the number of FLOPs in a neural network, they're usually looking to make the network faster and less resource-intensive. This is where efficient network architectures (like MobileNet or SqueezeNet) come into play; they are designed to maintain high accuracy while significantly reducing the number of FLOPs required for operations.

Depthwise Separable Convolutions

Depthwise separable convolutions are a type of convolutional operation that breaks down the standard convolution into two separate layers, aiming to reduce the computational cost and the number of parameters. This technique is used in neural network architectures to make them lighter and faster, which is especially beneficial for devices with limited computational power, such as mobile devices. It’s commonly seen in efficient models like MobileNets.

The standard convolutional operation combines both filtering and combining steps in a single process, applying a filter to the input data and combining the results to form a single output. Depthwise separable convolutions divide this into two layers:

Depthwise Convolution: In this step, a single filter is applied to each input channel (depth). If the input has 32 channels, you'll have 32 filters, and each one works on its respective channel. This operation is responsible for filtering the inputs.
Pointwise Convolution: After the depthwise convolution, a 1x1 convolution is applied. It takes the output of the depthwise convolution and combines the channels to create new features. This step is responsible for combining the outputs from the depthwise convolution.

The main advantages of depthwise separable convolutions are:

Reduced Complexity: They require significantly fewer parameters and computations compared to standard convolutions, making the network faster and lighter.
Efficiency: This efficiency makes depthwise separable convolutions particularly useful for mobile and embedded vision applications, where computing resources are constrained.
Less Overfitting: With fewer parameters, there’s a lower risk of overfitting, especially when training data is limited.

To illustrate the reduction in computational cost, consider a standard convolutional layer that has an input size of Df x Df x M, a filter size of Dk x Dk x M x N, where Df is the dimension of the input feature map, Dk is the kernel size, M is the number of input channels, and N is the number of output channels. The standard convolutional layer would have Dk x Dk x M x N x Df x Df FLOPs.

In contrast, a depthwise separable convolution would have Dk x Dk x M x Df x Df for the depthwise step and M x N x Df x Df for the pointwise step, significantly less than the standard convolution.

Because of their efficiency, depthwise separable convolutions are a cornerstone of models designed for real-time or edge device applications where computational resources are limited.

Quantization

Robustness to Precision Loss: Deep learning models are generally robust to a significant reduction in numerical precision during inference, thanks to their ability to learn from noisy data and focus on important patterns.
32-bit vs. 8-bit Representations: While 32-bit floating-point representations are standard, most inference tasks do not require this level of precision. It's common to use 16-bit for training and 8-bit for inference without a noticeable loss in accuracy.
Efficiency in Embedded Systems: Using 8-bit representations is beneficial for embedded systems as they often have hardware that efficiently supports 8-bit operations, common in signal processing.
Quantization Process: Converting a model to use 8-bit values involves determining the correct scale factors for weights (which is straightforward) and activations (which is more complex due to variable output ranges).
TensorFlow's Approach: TensorFlow has streamlined the quantization process, integrating it into the model conversion from TensorFlow to TensorFlow Lite format.
Post-training Weight Quantization: This method quantizes weights to 8-bits post-training, reducing model size and offering some performance benefits, but still requires floating-point operations for activations.
Post-training Integer Quantization: Preferred for the embedded use cases discussed, this method allows models to run without any floating-point operations, which is beneficial for latency and resource-constrained devices.
Activation Layer Ranges: To perform integer quantization, example input data is needed during export to estimate the range of activation layer outputs.
Optimization for New Hardware: For new devices, you may need to optimize operations, like Conv2D, to benefit from hardware-specific instructions.
Legacy Quantization Approaches: The document mentions uint8 kernels as an older quantization method, advising the use of int8 for current practices.

Performance Profiling

Understanding how long each part of your code takes is crucial for optimization but can be challenging on embedded systems.

Blinky: Use an LED on a development board to time code execution manually with an external stopwatch for durations over half a second.
Shotgun profiling: Temporarily comment out pieces of code to gauge their impact on execution time, effective due to the lack of data-dependent branches in neural network code.
Debug logging: Although feasible, this method can be time-consuming and may introduce additional latency due to communication with a host computer.
Logic analyzer: For more precise timing, toggle GPIO pins on/off and measure with an external logic analyzer, though this can be costly and requires hardware setup.
Timer: Utilize a precise timer to log start and end times around code sections of interest, though it requires manual setup for each section and is specific to the chip used.
Profiler: The most efficient option, if available, uses external tools that provide detailed insights into which functions or lines of code are time-consuming.