Key Evaluation Techniques for LLMs

By | September 15, 2024
Key Evaluation Techniques for LLMs
Key Evaluation Techniques for LLMs | Image by Author

Large Language Models, or LLMs for short, have attracted a lot of attention in past few years due to their amazing capabilities. As a result, there is a growing demand to integrate LLMs in production applications. Before putting them into production, it is critical to design proper evaluation strategies to ensure their positive business impact. This article “Key Evaluation Techniques for LLMs“, highlights some important evaluation techniques for LLMs.

Evaluating LLMs is a challenging task due to their complex nature. The way they generate outputs, is quite unintuitive, and it is almost impossible to tell whether the generated outputs are factually correct or not (they are grammatically correct most of the times though). As their output can’t be trusted, we can’t just integrate them into our production applications or products.

So, it is really important to understand the key evaluation techniques for LLMs. Also, it is critical to design the best suited evaluation strategy that meets your goals (for your unique business use case). Keeping this in mind, lets dive into the world of LLM evaluation.

This article covers the following key topics about LLM Evaluation techniques:

Let’s get started.


1. History of Large Language Models

1.1 Language Models Before 2017

Before 2017, language models were relatively small and their capabilities were limited. It was a common practice to develop a single model to solve a single unique problem such as sentiment analysis, text classification, named entity recognition (NER) and so on. ML practitioners were focused on keeping the models small and simple, and often a number of task specific feature engineering techniques were applied to get the best results on that particular task.

Evaluating such small language models was not very difficult as the task that they were trained to solve, was well defined (or already known). And, often a holdout test dataset was present to check if the model is giving expected results or not. If the model is not working as expected, the ML practitioners could always tweak their feature engineering technique, or model architectures, or both to get the best results on holdout dataset.

Things changed significantly after 2017.


1.2 Language Models After 2017

In 2017, a research paper from Google, titled ”Attention Is All You Need“, introduced the popular “Transformers” architecture, and the things changed drastically (Figure below depicts the core Transformer architecture, as shown in the original research paper). If you are interested in going into low level details of this architecture, you can read the research paper, it is really well written.

Key Evaluation Techniques for LLMs
Transformer Architecture | Taken from original paper ‘Attention is all you need’ | Key Evaluation Techniques for LLMs

The advent of Transformers, led to the development of Large Language Models (LLMs). Today, we have models with billions of parameters capable of solving complex language tasks, and almost all of them utilize Transformer architecture as their key building block.

Also, check out my introductory articles on GenAI:
1. A Gentle Introduction to Large Language Models
2. Beginner Friendly Introduction to GenAI and Its Applications

Today, we have Instruction Tuned (IT) Large Language Models. These models are very advanced generative LLMs and have the following key capabilities:

  • Ability to accept Super long inputs
  • Ability to solve tons of language tasks
  • Multiple language support
  • Muti-modality
  • Performance surpassing even humans

Now, let’s understand some key things about these models.


1.3 Latest LLMs are Mysterious

When it comes to integrating AI models into business applications or products, the key considerations usually are: safety, security and performance. Given that the current LLMs are black box models (due to their training and generation process), their behavior can be quite surprising sometimes. Additionally, these LLMs are extremely sensitive to prompting and other input changes.

Given the surprising behaviors of current LLMs, we must be really careful while deploying these models in production. As their true capabilities and performance is not already known. Thus, it becomes extremely important to design and consider the correct evaluation strategies based on the problem you are solving.

The next section describes the key evaluation techniques for LLMs.


2. Key Evaluation Techniques for LLMs

LLM evaluation techniques are a set of metrics and benchmarks that are used to assess the performance of LLMs on a variety of tasks. Some key benefits of evaluating LLMs are as follows:

LLM evaluation techniques can be broadly categorized into the following three categories:

Each of these types of evaluation techniques, have their own pros and cons. Together they make an overall LLM evaluation toolkit. In the next sections, we will learn about these methods in more details.


3. Human Evaluation

In this technique, human evaluators judge the outputs of a LLM against some rubric or against another LLM. This technique is quite popularly used and can be applied for the evaluation of almost any tasks. But, human evaluation is quite slow, expensive and often requires skilled human evaluators.

3.1 How does human evaluation work?

The idea is to generate a bunch of model outputs and show them to a human for evaluation. Despite it’s drawbacks (slow and costly), this evaluation technique is still considered the gold standard for evaluating LLMs. This is especially useful in case of generative models, as programmatic evaluation may not be reliable every time (for every task).

There are different platforms available out there for human evaluation, and there are quite a few third-party companies that sell human evaluation as a service.

Human evaluation is not always perfect. It comes with it’s own challenges. Following are some of the key challenges with human evaluation:

Let’s learn about the programmatic evaluation techniques now.


4. Programmatic Evaluation

In this technique, as the name also suggests, the model output is evaluated using an automated script. Programmatic evaluation requires a golden testing dataset with expected outputs. As this evaluation is done using a script, it is quite fast, but the curation of the golden dataset might take a lot of time and efforts.

Programmatic evaluation is a good choice when the structure of the output is well defined and the expected outcome is already known. These methods often use known metrics (old ML evaluation metrics such as F1 score, Precision and Recall) for calculating model performance. Some example scenarios where programmatic evaluation can be applied are as follows:

You need to be really careful while curating the golden dataset ensuring that the labels are absolutely correct, and there is no leakage of data between the training and the testing set. The test dataset should cover almost all different scenarios (types of inputs).

Next, let’s learn about the AutoRater technique.

If you are interested in learning more about the generative learning and Generative Adversarial Networks, Do check out my book:

https://www.amazon.com/GAN-Book-Generative-Adversarial-TensorFlow2-ebook/dp/B0CR8C725C


5. AutoRater Evaluation Tool

AutoRaters are automated systems or algorithms to assess the performance of models (Note that these are different from programmatic evaluations where the evaluation script is very simple and just compares the model results with known results).

AutoRater technique is very frequently used for evaluating LLMs. AutoRater can be an AI model itself that was trained to mimic human raters on tasks such translation, creativity analysis or safety analysis.

As AutoRaters are sometimes LLMs, it is critical to make sure that AutoRater reflects the human raters correctly that it is trying to mimic.

AutoRaters are a good option for evaluating performance of the LLM on the situations where we want to assess the model for following:

In summary, while AutoRater evaluations offer efficient and scalable methods for assessing large language models, they are often complemented by human evaluations to ensure a comprehensive understanding of the model performance.

In the next section, we will lean about some frequently use key metrics for LLM evaluation.


6. Key Metrics for LLM Evaluation

In this section, we will learn about some frequently used evaluation metrics for LLMs along with their task specific use cases.

6.1 Accuracy, AUC, F1, Precision, Recall

Metrics such as accuracy, f1-score, precision and recall are often used for evaluating classification tasks. Where, the top ranked predictions of the model are compared against the already known outputs. These metrics are often used for classification tasks such as Sentiment Analysis, Image Classification and so on.

6.2 Toxicity

Measures the toxicity of the text. Toxicity of a LLM is usually evaluated using another LLM (an AutoRater) that is specifically trained for identifying toxicity in other models. Common toxicity evaluation tools are – Perspective API and Detoxify.

3. BLEU

Bilingual Evaluation Understudy, or BLEU, measures the similarity between the model translation and reference translation. BLEU provides sentence level scores, and then the overall score is usually calculated by averaging the scores over full evaluation dataset. Now-a-days, BLEURT is often preferred over BLEU. Next, let’s learn about BLEURT.

4. BLEURT

Measures if the text generated by model conveys the same meaning as the reference text. BLEURT scores are calculated using a regression model trained on ratings data. It is a good metric for evaluating language translations.

5. SQuAD

Stanford Question Answering Dataset, or SQuAD for short, is a reading comprehension dataset. This dataset has questions on a set of wikipedia articles, such that the answers to those questions are present as a segment within those articles, or the answer may not be present at all. SQuAD is a very popular metric for evaluating LLMs for question answering use cases, where the answer is present in a already known knowledge base.

6. ROUGE

Recall-Oriented Understudy for Gisting Evaluation, or ROUGE, is useful for evaluating text generation tasks like Text-summarization. ROUGE has multiple different versions but the basic idea is to measure the overlap between the text generated by the model and the reference text.

Some popular variants of ROUGE are:

  • ROUGE-N: Checks the common n-grams between the model generated text and the reference text. ROUGE-1 counts the number of shared unigrams, ROUGE-2 counts the number of shared bigrams and so on. ROUGE-2 is a good metric for evaluating summarization.
  • SP-ROUGE-N: It is ROUGE-N applied to the text that is tokenized using Sentence Piece (SP) tokenizer.
  • ROUGE-L: Considers the Longest Common Subsequence (LCS), between the model output and the reference text for measuring similarity.

ROUGE is a popular metric for evaluating summarization and machine translation tasks.


In this article, we discussed the brief history of evolution of LLMs and the importance of right evaluation strategies before putting them into production.

We learned about three key evaluation techniques for LLMs: Human Evaluation, Programmatic Evaluation and AutoRater Evaluation. Finally, we listed down some key evaluation metrics for LLMs along with their key use cases.

I hope this article, Key Evaluation Techniques for LLMs, was helpful for the readers, in understanding the evaluation of LLMs. Please let me know your thoughts by commenting below.

See you in the next article!


Read Next>>>

  1. Faster Training of Large Language Models with Parallelization
  2. A Gentle Introduction to Large Language Models
  3. Beginner Friendly Introduction to GenAI and Its Applications
  4. How Does a Generative Learning Model Work?
  5. Building Blocks of Deep Generative Models
  6. Generative Learning and its Differences from the Discriminative Learning
  7. Image Synthesis using Pixel CNN based Autoregressive Generative Models
  8. What are Autoregressive Generative Models?
  9. Best Practices for training stable GANs