Optimizing TensorFlow models with Quantization Techniques

By | September 18, 2020

Deep Learning models are great at solving extremely complex tasks efficiently but this superpower comes at a cost.

Due to a large number of parameters, these models are typically big in size(memory footprint) and also slow in the inference (during predictions).

Slow and heavy models are not much appreciated when it comes to the deployment part. As we want to achieve maximum performance with minimum hardware requirements.

Quantization techniques (supported by both TensorFlow and PyTorch) are basically designed to solve this problem. These techniques aim at providing smaller and faster models while keeping the performance of the models almost similar.

Rest of the article is divided into the following parts, where we will talk about the model quantization step-by-step:-

  1. What is quantization?
  2. Quantization aware training
  3. Post-training quantization
  4. Post-training Quantization techniques
  5. Evaluation
  6. Summary

1. What is quantization?

Deep Learning models are usually trained with FP-32(floating-point-32) tensors. Due to a large number of parameters, the resulting model becomes big in size and slow in inference(performance).

Quantization is a technique that aims at storing tensors at lower bit-widths than their original size by performing various mathematical computations.

Typically deep learning frameworks (TensorFlow and PyTorch) represent tensors with floating-point(FP-32) data types. A quantization technique might restrict FP-32 tensors to only 8-bit integers (this kind of quantization comes under integer quantization).

The resulting quantized model would be around 4x smaller in size. Hardware support for INT-8 computations is typically 2 to 4 times faster compared to FP-32 compute.

Another way of optimizing models is to use ‘quantization aware training’. This will keep the model quantized during the model training itself.TensorFlow and PyTorch both support quantization aware training and it is considered better wrt. the model accuracy.

In this article, we will discuss how to quantize TensorFlow-based models. Let’s see how it works-


2. Quantization aware training

Keras API in TensorFlow supports training time quantization with both sequential as well as Functional types of models. This technique often produces better accuracy then post-training quantization methods.

With the default settings, API shrinks the model size by 4x and provides CPU latency reduction by 1.5-4x. The following table shows the performance of SOTA models trained with this technique-

Quantization Aware Training | dropsofai.com
Quantization Aware Training | Image Source

As quantization aware training is still in the experimental stage, not all the deep learning layers are supported at the moment. Also, there is no guarantee for backward compatibility as per today.


3. Post-training quantization

In most cases, the deep learning model is trained with FP-32 tensors and later converted to INT-8(or float-16) in order to get a smaller and faster model for deployment.

Post-training quantization is a bit more stable than quantization aware training and easy to use. The following decision tree can be used to decide on what technique would be best for your model-

Post-Training quantization techniques
Decision Tree with post-training quantization techniques | Image Source

In post-quantization techniques, we train the deep learning model normally and save the weights. These weights are later converted into TFLite format and quantized. Going further, we will train a small example model and apply three different post-training optimization techniques to optimize it.


4. Post-training quantization techniques

To understand post-training quantization techniques, let’s train a deep learning model first and then try to optimize it.

In this exercise, we will use handwritten digits dataset from sklearn and train a small classifier. The following code loads the dataset and prepares it for model training.

import pandas as pd
import numpy as np
import tensorflow
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
%matplotlib inline

#loading dataset
digits = load_digits()
print(digits.data.shape)

images = digits['images']
labels = digits['target']
print (images.shape, labels.shape)

#Splitting Data
X_train, X_test, y_train, y_test = train_test_split(images, labels, test_size=0.25, random_state=42)
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#Encoding Labels
def get_encoded_labels(target):
    output=np.zeros((len(target),10))
    for ix, value in enumerate(target):
        output[ix][target[ix]] = 1
    return output
Y_train = get_encoded_labels(y_train)
Y_test = get_encoded_labels(y_test)
print (Y_train.shape, Y_test.shape)
Out[1]: (1797, 64)
        (1797, 8, 8) (1797,)
        (1347, 8, 8, 1) (450, 8, 8, 1) (1347,) (450,)
        (1347, 10) (450, 10)

Here is a small classifier for our task, it takes 8×8 image as input and generates an output probability distribution over digits (0 to 9).

input_layer = Input(shape=(8, 8, 1))
layer = Conv2D(64, (3,3), activation='relu')(input_layer)
layer = Conv2D(32, (3,3), activation='relu')(layer)
layer = Conv2D(32, (3,3), activation='relu')(layer)
layer = Flatten()(layer)
features = Dense(32, activation='relu')(layer)
output = Dense(10, activation='softmax')(features)

model = Model(inputs=input_layer, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')
model.summary()
Model: "functional_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         [(None, 8, 8, 1)]         0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 6, 6, 64)          640       
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 4, 4, 32)          18464     
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 2, 2, 32)          9248      
_________________________________________________________________
flatten_3 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                4128      
_________________________________________________________________
dense_7 (Dense)              (None, 10)                330       
=================================================================
Total params: 32,810
Trainable params: 32,810
Non-trainable params: 0
_________________________________________________________________

Let’s train our model:-

model.fit(X_train, Y_train, batch_size=32, epochs=10, validation_data=(X_test, Y_test))
Epoch 1/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0365 - accuracy: 0.9903 - val_loss: 0.1204 - val_accuracy: 0.9644
Epoch 2/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0763 - accuracy: 0.9829 - val_loss: 0.0968 - val_accuracy: 0.9711
Epoch 3/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0134 - accuracy: 0.9985 - val_loss: 0.0859 - val_accuracy: 0.9800
Epoch 4/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0089 - accuracy: 0.9985 - val_loss: 0.0898 - val_accuracy: 0.9778
Epoch 5/10
43/43 [==============================] - 0s 3ms/step - loss: 0.0092 - accuracy: 0.9985 - val_loss: 0.0755 - val_accuracy: 0.9822
Epoch 6/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0031 - accuracy: 1.0000 - val_loss: 0.0715 - val_accuracy: 0.9844
Epoch 7/10
43/43 [==============================] - 0s 3ms/step - loss: 0.0024 - accuracy: 1.0000 - val_loss: 0.0715 - val_accuracy: 0.9844
Epoch 8/10
43/43 [==============================] - 0s 3ms/step - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.0699 - val_accuracy: 0.9844
Epoch 9/10
43/43 [==============================] - 0s 4ms/step - loss: 0.0018 - accuracy: 1.0000 - val_loss: 0.0693 - val_accuracy: 0.9844
Epoch 10/10
43/43 [==============================] - 0s 3ms/step - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.0707 - val_accuracy: 0.9844

Saving the model and checking the test accuracy for digit recognition task:-

model.save("saved_model.pb")

def get_test_accuracy(predictions, target):
    correct = 0
    for ix, pred in enumerate(predictions):
        true_value = target[ix]
        if pred[true_value] == max(pred):
            correct += 1
    return correct*100/len(target)

predictions = model.predict(X_test)
get_test_accuracy(predictions, y_test)
Out[1]: 98.44444444444444

There are three post-training quantization techniques available for TensorFlow based deep learning models. We will apply and test all three on the saved model.

List of post-training quantization techniques-

  • Dynamic range quantization
  • Full Integer quantization
  • Float-16 quantization

a. Dynamic range quantization

This is the simplest post-training quantization technique. This works by statically quantizing the weights of the model from floating point to 8-bit integers.

The activations are always stored in floating point. For operations that support quantized kernels, the activations are quantized to 8-bits of precision dynamically prior to processing and are de-quantized to float precision after processing.

Depending on the model being converted, this can give a speedup over pure floating-point computation. The following python code can be used to apply the dynamic range optimization technique-

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("./saved_model.pb/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

b. Full Integer quantization

Full integer quantization further improves the model-size and latency by converting all the model mathematics into integer-based calculations. This technique produces optimized models that are compatible with integer-only hardware devices and accelerometers.

To measure the dynamic range of activations, some real sample input examples are passed during the model optimization. representative_dataset_gen function in the following code passes the sample input during the conversion of the model.

There are two ways to perform full integer quantization:-

  1. Integer with float fallback
  2. Integer only

i) Integer with float fallback (using default float input/output)

This method does the full integer quantization while keeping the input/output still as float-32. But the resulting model won’t be compatible with integer only devices/accelerometers in this case.

Here is how you can perform integer with float fallback quantization

import tensorflow as tf
num_calibration_steps=1
converter = tf.lite.TFLiteConverter.from_saved_model("./saved_model.pb/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_dataset_gen():
    for _ in range(num_calibration_steps):
        input_data = [X_test[:10].astype('float32')]
        yield input_data
        
converter.representative_dataset = representative_dataset_gen
tflite_quant_model2 = converter.convert()

ii) Integer only

To make the resulting model compatible for integer only device, we can enforce the full integer quantization for all the operations including input and output as well.

The quantized model will only support the input of type int-8 format. Also, the output layer will give you integer outputs. Here is the sample code to apply Integer only quantization to your deep learning model.

import tensorflow as tf
num_calibration_steps=1
converter = tf.lite.TFLiteConverter.from_saved_model("./saved_model.pb/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_dataset_gen():
    for _ in range(num_calibration_steps):
        input_data = [X_test[:10].astype('float32')]
        yield input_data
        
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  # or tf.uint8
converter.inference_output_type = tf.int8  # or tf.uint8
tflite_quant_model2 = converter.convert()

If there is any block of the model that is not supported yet, this code might throw an error for that.

Note: inference_input_type and inference_output_type attributes are only supported by TensorFlow-version >= 2.3.0

https://www.tensorflow.org/

c. Float-16 quantization

Float-16 quantization reduces the model-size by converting model weights from FP-32 to FP-16 numbers. This technique reduces the model size to approximately half and results in minimum accuracy loss.

This approach does not reduce the latency as much as other techniques though, also this is not optimized for CPU based inference. The following code can be used to apply FP-16 quantization technique on the already trained models-

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("./saved_model.pb/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model3 = converter.convert()

5. Evaluation

The post-training quantization techniques discussed above were applied to our previously trained classification model. All of these techniques reduce the model’s memory footprint from 442KB to below 80KB. And models latency is reduced to approximately half.

The following table shows the experiment results on our example model. All these results are based on model inference over 450 test images

Technique Accuracy Change
(from-to %)
Latency Change
(from-to ms)
Model size Change
(from-to KB)
hardware
Dynamic Range Quantization98.44 – 98.4475 – 30440 – 45CPU
Full Integer with float fallback98.44 – 98.4475 – 460440 – 43CPU
Full Integer
(Integer only)
Quantization

98.44 – 26.12
75 – 480440 – 41CPU
Float-16 Quantization98.44 – 98.4475 – 31440 – 72CPU
Experimental results

The full Integer (Integer only)quantization technique shows a big accuracy drop, while other techniques perform equally well for our little experiment.

Inference with quantized model

The quantized models are typically in TFLite format. The following python code reads the tflite models into the memory.

import tensorflow as tf
interpreter = tf.lite.Interpreter(model_path="./tflite_quant_model_f16q.hdf5")

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.allocate_tensors()

print(input_details)
print(output_details)
[{'name': 'input_4', 'index': 0, 'shape': array([1, 8, 8, 1], dtype=int32), 'shape_signature': array([-1,  8,  8,  1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}]

[{'name': 'Identity', 'index': 18, 'shape': array([ 1, 10], dtype=int32), 'shape_signature': array([-1, 10], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}]

You can pass the input data as following after converting original input data to float-32 format.

predictions = []
for img in X_test:  
    interpreter.set_tensor(input_details[0]['index'], [img.astype('float32')])
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    predictions.append(output_data[0])
predictions = np.array(predictions)
get_test_accuracy(predictions, y_test)
Out[1]: 98.44444444444444

Below are the latency and accuracy results for post-training quantization and quantization-aware training on a few models from the TensorFlow official website. All latency numbers are measured on Pixel 2 devices using a single big core CPU.

Quantization restuls | dropsofai.com
TensorFlow experiments on SOTA models | Image Source

6.Summary

In this article, we discussed various model quantization techniques aimed at making your deep learning TensorFlow/Keras based model smaller and faster for deployment.

We have also seen improvements in an example model where model-size and latency have improved significantly without any significant drop in the model efficiency.

In practice, model accuracy might drop a little with these optimization techniques when they are applied to a bigger and deeper model.

You might like to read next:

Deep Learning with PyTorch: Introduction

Thanks for reading! Hope you have enjoyed the article. Do let me know your thoughts by commenting below. See you in the next article 🙂

Reference https://www.tensorflow.org/lite/performance/model_optimization