Boosting your Sequence Generation Performance with ‘Beam Search + Language model’ decoding

By | July 20, 2020

when, why and how of ‘ Beam Search ‘ and LM decoding

Whenever Image Processing, Audio data Analysis or Natural language processing (NLP) tasks are concerned, Deep learning has proved to be an ideal choice and has shown outstanding outcomes. Neural Network-based model architectures are really good at understanding complex patterns as well as generating meaningful and realistic data. Although deep learning-based solutions are generally very efficient, it’s never a bad idea to use better post-processing techniques( Beam Search Decoding and Language model decoding) to make predictions even more accurate.

Complex problems such as Neural Machine Translation (NMT), Image caption generation (ICG) and speech recognition (ASR) are very much solvable today using deep learning. These problems are categorized as sequence generation problems where given an input, the model learns to generate some text sequence. If you take a look at the research papers showing state of the art (SOTA) results on these tasks, you will probably find their solution utilizing a beam search decoder fused with a Language model to boost the results. Let’s learn more about these decoding techniques with examples —

Going further, we will define an example sequence generation problem and explore post-processing (decoding) techniques. We will start with a greedy-search-decoding technique and introduce beam-search-decoding fused with language model to further improve the overall results.

Example Problem

Consider a problem of English language text generation and suppose we have already trained a model to do that. Depending upon the nature of the problem or solution strategy, our model might be a character-level model or a word-level model. A character-level text generator model generates text by predicting one character at a time. Similarly, a word-level text generator predicts one word at a time and multiple predicted such words make a sequence.

Assume we have trained a character-level model that generates text by predicting one character at a time. This problem can be related to any of the following problems — Speech to Text, Optical Character Recognition, Image caption generation….etc. Such problems are usually solved using an encoder-decoder architecture as shown in the figure below. The encoder part is responsible for taking an input vector (audio, image or text….) and producing an encoded context vector. The decoder part then uses that context vector to generate the output sequence by predicting one token (char/ word) at a time.

Image for post
https://stackoverflow.com/questions/45977990/tensorflow-how-to-embed-float-sequences-to-fixed-size-vectors

As our model is character-level, It will generate a probability distribution over all the possible characters for each token in the output sequence. In other words — For each token (char) our model is predicting, it will generate a probability array of length 26 (as per the English language — a to z) and the probabilities will show how likely a particular character is to be the output token.

For a predicted sequence of length 10 (chars), ourmodels’ output would look something like this —

output_shape = 
10 * 26 output = [[0.1, 0.01, 0.03, ... ... ... (len-26)],
                  [0.1, 0.002, 0.6, ... ... ... (len-26)],
                    ''         ''      ''       ''    
                    ''         ''      ''       '' 
                  [0.9, 0.01, 0.01, ... ... ... (len-26)]]

To convert this probabilistic output into a readable form (English text), we need a decoding algorithm. The simplest decoder would be a greedy-search-decoder. Let’s write a greedy-search-decoder along with a complex one called beam-search-decoder —

  • Note — A sequence generation problem doesn’t necessarily involve text generation, it could be any kind of sequence. We are taking a text generation problem as an example because we are going to talk about language modeling as well in this post.

Greedy search decoder

The simplest way to decode the models’ predictions is to consider the most likely sequence as output. A greedy decoder, also called as the best-path decoder considers the token (character or word) with the highest probability in each prediction and concatenates all the predicted tokens to get the final output sequence. This is a simple and fast way to get the output sequence. But it may not always give you the best output.

Here is how you write a greedy decoder for your model in Python —

import numpy as np
def greedy_search_decoder(predictions):
  
    #select token with the maximum probability for each prediction
    output_sequence = [np.argmax(prediction) for prediction in predictions]
    
    #storing token probabilities
    token_probabilities = [np.max(prediction) for prediction in predictions]
    
    #multiply individaul token-level probabilities to get overall sequence probability
    sequence_probability = np.product(token_probabilities)
    
    return output_sequence, sequence_probability
    
model_prediction = [[0.1, 0.7, 0.1, 0.1],
                    [0.7, 0.1, 0.1, 0.1],
                    [0.1, 0.1, 0.6, 0.2],
                    [0.1, 0.1, 0.1, 0.7],
                    [0.4, 0.3, 0.2, 0.1]]
                    
greedy_search_decoder(model_prediction)

[Out]: ([1, 0, 2, 3, 0], 0.08231999999999998)

Here the shape of the model’s prediction is 5*4 that means, the model is trying to generate a sequence of length five. As there are only four different tokens in the vocabulary, the model predicts the probability distribution of size four for each token in the sequence. As per the definition, the greedy decoder generates the sequence with the highest probability by choosing the most probable tokens at each time step.

Beam search decoder

Beam search decoding is another popular way of decoding model predictions that leads to better results than the greedy search decoder in almost all cases. Unlike greedy decoder, it doesn’t just consider the most probable token at each prediction, it considers top-k tokens having higher probabilities (where k is called the beam-width or beam-size). Although beam-search gives better results, it makes the overall pipeline slow as it is computationally complex.

Thus it doesn’t just give you one output sequence, it gives you k-different output sequences along with their likelihood (probabilities). Unlike greedy decoder where we had only a single output sequence, we have k different output sequences here and there is a very good chance that one of these k sequences is the correct output sequence.

algorithm

For the first time-step prediction, choose k tokens having higher probabilities instead of one (as in greedy we choose only one). Now going further, consider every token of the current prediction and append it to all the previously decoded k sequences and keep calculating corresponding probabilities of new sequences. Now choose top k sequences out of the new sequences based on the probability score and move to the next time-step. Repeat this process until the last token. Finally, return k sequences with their corresponding probability values.

Tip: word-beam-search is another variant of beam-search decoding technique that restricts to or chooses output sequences having dictionary words only. It performs better than a vanilla beam search in most cases.

Writing a beam search decoder in Python—

import numpy as np
import math

def beam_search_decoder(predictions, top_k = 3):
    #start with an empty sequence with zero score
    output_sequences = [([], 0)]
    
    #looping through all the predictions
    for token_probs in predictions:
        new_sequences = []
        
        #append new tokens to old sequences and re-score
        for old_seq, old_score in output_sequences:
            for char_index in range(len(token_probs)):
                new_seq = old_seq + [char_index]
                #considering log-likelihood for scoring
                new_score = old_score + math.log(token_probs[char_index])
                new_sequences.append((new_seq, new_score))
                
        #sort all new sequences in the de-creasing order of their score
        output_sequences = sorted(new_sequences, key = lambda val: val[1], reverse = True)
        
        #select top-k based on score 
        # *Note- best sequence is with the highest score
        output_sequences = output_sequences[:top_k]
        
    return output_sequences
    

model_prediction = [[0.1, 0.7, 0.1, 0.1],
                    [0.7, 0.1, 0.1, 0.1],
                    [0.1, 0.1, 0.6, 0.2],
                    [0.1, 0.1, 0.1, 0.7],
                    [0.4, 0.3, 0.2, 0.1]]
                    
beam_search_decoder(model_prediction, top_k = 5)

[Out] : [([1, 0, 2, 3, 0], -2.497141187456343),
         ([1, 0, 2, 3, 1], -2.784823259908124),
         ([1, 0, 2, 3, 2], -3.1902883680162883),
         ([1, 0, 3, 3, 0], -3.595753476124453),
         ([1, 0, 2, 3, 3], -3.8834355485762337)]

why log-likelihood?

It’s because a regular probability would cause problems when there are longer sequences. For example — Consider a sequence of length 100 is generated by the model. And the probability of each token in the sequence is 0.10 then the probability of the output sequence would be the multiplication of all these probabilities —

Probability of a sequence with 100 tokens would be--
P(sequence) = p(t1) * p(t2) * p(t3) .............. * p(t100)
            = 0.1 * 0.1 * 0.1 * .................... * 0.1
            = 10 ^ (-100) 
# An extremely small number 

As we can see it’s an extremely small number. Any programing language might fail to compare such small floating-point numbers as this could cause under-flow. This is why we calculate log-probabilities instead of regular probability. If you pay attention we are multiplying negative log-probability to calculate the score (line 16 in code), this is because the logarithm of a probability ( 0 < 1.0) is always a negative number. And thus we are choosing the top k sequences with minimum log-scores.

Probability of sequence with N tokens would be —

P(seq) = p(t1) * p(t2) * p(t3) ........ p(tN)

taking log on both sides (calculate log-likelihood)

log(P(seq)) = log(p(t1) * p(t2) * p(t3) ........ p(tN))  
log(P(seq)) = log(p(t1)) + log(p(t2)) + .. log(p(tN))

logarithm of a number < 1.0 will be always negative so  in this 
case log-likelihood of the sequence would be negative.

If we take a log on both sides, this multiplication will convert to a summation. Hence calculation of the log-likelihood instead of real probability would be faster and efficient. As we know logarithm of a number < 1.0 would always be a negative number. So we will get a negative score for our sequence. We can convert this score(log-likelihood) to the original probability by taking the anti-logarithm of this score. For the purpose of finding k best sequences using the beam-search, we only need to compare the probabilities of certain sequences, so the log-likelihood would work just fine.

++ Language model

Now we know that the beam search decoder gives you k different output sequences instead of one and there are good chances that one of these sequences is the correct output. But we don’t have a way to identify which of the output sequence is the best. This is where a language model comes into the picture.

what is a language model?

A language model is a statistical representation of a given language. It is supposed to be able to guess the next word if a list of previously occurring words is given, from a given sentence. In other words, A language model is something that can determine the probability of a given text-sequence (sentence) where sequence tokens (chars/words) arepulled from a fixed vocabulary (language vocab). So, basically it can score your sentence. A good score means that the sentence is contextually correct and belongs to the given language.

Language models are usually trained on a very large corpus of the text and are able to understand(learn) the context(co-occurrence) of words in the given sentences. A language model not necessarily has to be word-level, we can train the language model on characters as well. A character-level language model is supposed to guess the next character in a sequence, given the previous few characters. Again character-level models can also give you the probability (or likelihood) of a given sentence.

language model to score text sequences

Consider a scenario where a deep learning model is trying to generate some text (or consider a problem where the model is trying to convert voice to text) and the resulting probability distribution of characters is sent to the beam search decoder with a beam-width of 5. Assume that the following are the top 5 sequences generated using beam-search-decoding.

Generated Sequences                      Probability (model)
1. speech recogniiion is a hald problem           0.41
2. speech recog niiion isa hard problem           0.39
3. speech recognition is a hard problem           0.37
4. spe ech recogniion is a hard problem           0.32
5. spe ech recogni tion is ahard problem          0.29

A greedy-search-decoder will give you the first sequence as output because it is the sequence with the highest probability (0.41) as per our model. But clearly (3) is the correct output with slightly lesser probability.

To find the best sequence out of these 5 sequences, we will use the language model to check the likelihood of each of these sequences as per the English language. If our language model is well trained, it is supposed to give a good probability(score) to the third sequence as it is 100% correct as per the English language.

Assuming that the following are the likelihoods of these 5 sequences as per our language model.

Generated Sequences                          P(model)  P(LM) 
1. speech recogniiion is a hald problem         0.41    0.20
2. speech recog niiion isa hard problem         0.39    0.40
3. speech recognition is a hard problem         0.37    0.70
4. spe ech recogniion is a hard problem         0.32    0.35
5. spe ech recogni tion is ahard problem        0.29    0.47

Now to choose the best output sequence, we will consider the weighted sum of both the probabilities —

        P(final) = Alpha * P(model) + Beta * P(L.M.)

Values of the constants Alpha and Beta is decided after tuning them on the validation dataset. We choose the values which give the best result on the validation set. We then use those values to evaluate our model on the test set.

*Note: Sometimes the sequence length is also considered while calculating the final score of a sequence. If we don’t penalize the scores with the sequence lengths, our model will give more preference to the smaller sequences. As the score would be lesser for the longer sequences (multiplication of more probability values (p < 1.0) will result in even smaller scores).

Tip: It is observed that using a word-level language model gives better results in most cases. To be able to use a word-level language model for scoring, output sequences should be restricted to the dictionary words only. Thus a word-beam-search algorithm is used to come up with restricted beams and later those beams are re-scored with the language model to decide the final output.

PS: Usually, as per research-papers, a very large number is used as beam-size (~1000–2000) is used while applying beam-search. It delivers better results but at the cost of speed as inference becomes quite slow.

Few research papers utilizing beam-search decoding and a language model to improve the results —

  1. https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
  2. Listen, Attend and Spell (https://arxiv.org/pdf/1508.01211.pdf)
  3. Show and Tell (https://arxiv.org/pdf/1411.4555.pdf)
  4. SEQUENCER (https://arxiv.org/pdf/1901.01808.pdf)

Image Sources:

  1. (matrix image) https://www.shutterstock.com/image-vector/digital-binary-code-matrix-backgrounddata-flood-1341680024
  2. (LSTM autoencoder) https://stackoverflow.com/questions/45977990/tensorflow-how-to-embed-float-sequences-to-fixed-size-vectors

That’ll be all !

Thanks for reading, Don‘t forget to share your thoughts with me.