1D-CNN based Fully Convolutional Model for Handwriting Recognition

By | September 24, 2020

Handwriting Recognition also termed as HTR(Handwritten Text Recognition) is a machine learning method that aims at giving the machines an ability to read human handwriting from real-world documents(images).

The traditional Optical Character Recognition systems(OCR systems) are trained to understand the variations and font-styles in the machine-printed text(from documents/images) and they work really well in practice(example-Tesseract). Handwriting Recognition on the other hand is a more challenging task due to a large number of variations among the handwritings of people.

Photo by John Jennings from Unsplash
Photo by John Jennings from Unsplash | Image Source

Recent progress in deep learning has led to the development of efficient OCR/HTR solutions. Although these models perform remarkably well in practice, these aren’t easy to train, understand and deploy due to the following limitations:-

  1. They require a huge amount of labeled training data.
  2. Due to a large number of training parameters, they are hard to train and slow in inference.
  3. As they are slow, they require huge deployment cost(hardware requirements) to make them useful in real-time applications.
  4. Models are complex in nature and difficult to scale(stacked LSTMs, complex attention layers).

In this article, we will talk about a novel deep learning architecture (EASTER) that solves the above-listed challenges to some extent. This architecture in a fast, scalable, simple, and also efficient than many complex choices for the task of OCR and HTR.

EASTER model utilizes only one-dimensional convolutional layers for the task of HTR and OCR.

EASTER handwriting recognition OCR results
EASTER model results from the original paper | Image Source

Link to the original paper:

EASTER: Efficient and Scalable Text Recognizer


Here is a list of items that this article is going to cover regarding EASTER model-

  1. EASTER Overview
  2. 1D-CNN on images? Really? how?
  3. EASTER Model Architecture
  4. OCR/HTR Capability with zero Training Data
  5. Results
  6. Summary

EASTER Overview

EASTER (Efficient and Scalable Text Recognizer), is a fully convolutional architecture that utilizes only 1-D Convolutional layers in the encoder and adds a CTC-decoder(Connectionist Temporal Classification) at the end.

EASTER sets a new way of visualizing and efficiently solving OCR/HTR tasks with only 1-D Convolutional layers.

Here are a few important points about EASTER architecture:-

  1. Fully Convolutional architecture that is parallelly trainable on GPUs.
  2. Only 1-D Convolutional layers, faster with less-parameters.
  3. Works well even when training data is limited.
  4. No complex layers (easy to understand).
  5. Works well for line-level OCR/HTR tasks.

In addition to the EASTER architecture, this paper also presents a synthetic data generation pipeline with an augmentation setup. That means you can train your own OCR/HTR system with zero training data requirements.

Now the question comes-How do you apply the one Dimensional Convolutions on a two-dimensional image. This is a very valid question, and the next paragraph explains it-


1D-CNN on images? Really? how?

Consider an input image of size 600 X 50 (W X H) as shown in the figure below.

Here, if you draw any vertical line in this image, you will only cut a single character (if not drawn in white-space), and if you draw a horizontal line you will probably end up cutting all the characters.

In other words, what I am trying to say here is- Along the height of the image you will only find the properties of a single character while along the width you will find all different characters as you move from left-to-right.

1-D Convolutional filter movement in EASTER model
1-D Convolutional filter movement in EASTER model

So basically, the width can be assumed as a time dimension where if you move along the time, you find the different subsequent characters, while height represents the properties of a character at a given time-stamp.

A one-dimensional filter of kernel size-3 actually means a filter of dimension 3 in the time dimension(along the width, 3 pixels at a time) that covers the overall height of 50 pixels(H). So, basically a filter of kernel size-3 means a filter of 3×50 (or 3xH) dimensions (just like 1-D CNN works for NLP word embeddings).

As shown in the figure above, this red rectangular box is a 1-D convolutional filter that scans the full height of the image as it moves on the time dimension(the width) from left to right. Each scan stores the information of the observed character(or part of the character).

This information is finally passed to a softmax layer that gives a probability distribution over all the characters possible for each time-step along the width. This probability distribution is then passed to the CTC decoding layer to generate the final output sequence.


EASTER Model Architecture

Easter model architecture is quite simple that utilizes only 1-D Convolutional layers for the task of OCR and HTR.

Easter encoder part consists of multiple stacked 1-D Convolutional layers where kernel-size increases with the depth of the model. The effectiveness of stacked 1-D Convolution based networks to handle the sequence-to-sequence tasks has already been proved in the area of ASR (Automatic Speech Recognition).

Easter Block

The basic structure of an EASTER block is shown in the figure below. Each block has multiple repeating sub-blocks. Each sub-block is made up of 4 ordered components-

  1. 1-D Convolutional layer
  2. Batch-Normalization layer
  3. Activation layer (ReLU)
  4. A Dropout layer
EASTER Block | Handwriting Recognition
Easter Sub-block | Image Source

Final Encoder

The overall encoder is a stack of multiple repeating EASTER blocks (discussed in the last paragraph). Apart from repeating blocks, there are four extra 1-D Convolutional blocks present in the overall architecture as shown in the figure below.

Preprocessing Block (Downsampling block)

This the first block of the model that contains two 1-D convolutional layers with a stride of 2. This block is used to downsample the original width of the image to width/4. Apart from the stride, all other components of the sub0-blocks are similar to the one discussed above.

Post-Processing Blocks

There are three post-processing blocks at the end of the encoder part, where the first one is a dilated 1-D Convolutional block with dilation of 2, the second one is a normal 1-D Convolutional block while the third post-processing block is a 1-D Convolutional block with number of filters equal to the number of possible outcomes (model vocabulary length) and with a softmax activation layer.The output of this layer is passed to the CTC decoder.

EASTER architecture for Handwriting Recognition
Easter Encoder | Image Source

CTC Decoder

EASTER encoder passes the output probability distribution of the encoded sequence to a CTC decoder for decoding.

To map the predicted output characters into the resulting output sequence, the EASTER model utilizes a weighted CTC decoder. This weighted CTC decoder results in the fast convergence of the model and gives better results than vanilla-CTC when training data is limited.

The configurations of this weighted-CTC is described in details in the paper.

3×3 Architecture Variant

EASTER 3X3: A 14-layered variant can be constructed using the table shown below. This is a very shallow/simple architecture with just 1M parameters yet very effective for the task of OCR/HTR.

EASTER 3X3 Architecture for Handwriting Recognition
EASTER 3X3 |Image Source

This model can be easily scaled to increase performance/capacity. In the experiments shown in the paper, a 5×3 variant achieves the state of the art performance for the tasks of HTR and OCR.


OCR/HTR Capability with zero Training Data

In addition to a novel architecture, the EASTER paper also describes the ways to synthetically generate training data for both machine-printed as well as handwriting recognition tasks.

Using these methods(well described in the paper), you can train an optical character recognition system (OCR) or a handwriting recognition system (HTR) of your own without any labeled data. As the configurable data generator shown in the paper will prepare the synthetic labeled training dataset for you.

The following figure shows some synthetically generated samples from the paper, they look very realistic-

Synthetically Generated Samples | Image Source

Results

The paper shows some amazing results on the IAM-offline line recognition tasks. The experiments on the handwriting recognition task prove that the EASTER model works really well even when the training data is limited.

Handwriting recognition results of EASTER are compared with one google’s paper on ‘A Scalable Handwritten Text Recognition System’ (aka GRCL) where the author shows good handwritten line recognition results with limited training dataset. EASTER model outperforms GRCL even with lesser training samples as shown in the table below.

EASTER results on handwriting recognition
Handwriting Recognition results on IAM offline test-dataset | Image Source

EASTER further shows SOTA results on the scenic text recognition (Machine Printed) tasks, without any augmentations, and with a greedy-search-decoding mechanism(without language model decoding).

Here is a screen print of model results on handwritten as well as machine-printed tasks from the paper itself-

EASTER model for handwriting recognition and machine printed OCR
EASTER model results on handwriting recognition | Image Source

Summary

In this article, we discussed a novel fully convolutional(with only 1-D Convolutions), end-to-end, OCR/HTR pipeline that is simple, fast, efficient, and scalable.

In addition to the architecture, we learned about how a 1- Dimensional Convolutional filter works on an image to be recognized.

Finally, we discussed the synthetic data generation pipeline along with the recognition results as shown in the original paper.

For more details, you can read the original paper here as it has a detailed explanation of all the aspects we have touched in this article.

Thanks for reading! Hope this article was helpful for you. Kindly let me know your feedback through the comments. See you in the next article.


References

  1. EASTER paper: https://arxiv.org/pdf/2008.07839.pdf
  2. GRCL paper: https://arxiv.org/pdf/1904.09150.pdf

Read Next>>

  1. Optimizing TensorFlow models with Quantization Techniques
  2. Deep Learning with PyTorch: Introduction
  3. Deep Learning with PyTorch: First Neural Network
  4. OpenCV: Introduction and Simple Tricks in Python