Faster Training of Large Language Models with Parallelization

Introduction

The story began in 2017 when researchers from Google introduced the famous Transformer architecture in their paper “Attention Is All You Need!”. Transformers turned out to be highly effective for tasks such as language translation, question-answering, and other similar sequence-to-sequence problems. Key features that make the Transformer architecture unique and effective, are — positional encoding and self-attention.

Large Language Models, or LLMs, also implement a Transformer architecture. These models are usually very large and need a huge amount of text data to train. Their size plays a big role in their success, but it also makes them difficult to train.

LLMs typically have an enormous number of trainable parameters, sometimes reaching hundreds of billions. This makes it challenging to fit them onto a single accelerator chip for training.

In this article, you will explore the techniques that help speed up these models through effective parallelization. Following are the key topics covered in this article —

What is Parallelism
Speeding Up LLM Training with Parallelization
Which parallelism is suitable for you?
References

Let’s get started.

What is Parallelism?

In computer programming, a large task can be broken down into smaller tasks that can run at the same time, speeding up the overall process. This approach is called parallel programming.

Similarly, in Deep Learning (DL), you can speed up the training of large models by splitting the work across multiple machines or accelerators like GPUs and TPUs. Parallelism in DL is mainly achieved through the following two key methods:

Data Parallelism
Model Parallelism

Let’s explore these concepts further.

Data Parallelism

In this approach, the training data is split into smaller parts and distributed across multiple machines or processors. Each machine trains an identical copy of the deep learning model using its portion of the data. The parameters learned by all machines are regularly combined to update the final model.

Model Parallelism

Here, the model itself is divided into smaller sections, with each part assigned to a different machine or processor. Each machine trains only its specific part of the model. The outputs from all parts are then combined to form the complete, final model.

Which technique should I use for training LLMs?

→ As LLMs are really large models and do not always fit on a single accelerator chip, model parallelism can help!!

→ Data parallelism and model parallelism can also be combined together to make training even faster.

Don’t Overlook the Challenge of “Combining Results”

When training a model using any parallelism technique, you’ll need to merge the learnings (or parameters) from different machines or processors to create the final trained model. This merging process often involves summing up or averaging the results from each machine. To do this, you must establish a way for machines to share information, which can slow down the process and become a bottleneck in training. The frequency of this communication depends on how the parallelism is implemented. Therefore, it’s crucial to account for this communication overhead when designing your parallel training strategy.

Now that you have got the basic idea of parallelizing DL models, lets jump into some technical aspects of parallelizing the training of Large Language Models (LLMs).

Speeding Up LLM Training with Parallelization

Training large language models (LLMs) can take an extremely long time because of their huge size and the vast amount of data they require for pre-training. If done on a single processor, it could take weeks or even months, which is highly impractical. To save time, it’s essential to use parallelization techniques to speed up the training process.

Here are some effective methods to train LLMs more efficiently:

Using Data Parallelism (DP)
Using Tensor Parallelism (TP)
Applying Pipeline Parallelism (PP)
Using Zero Redundancy Optimizer (ZeRO)

Let’s understand these techniques one-by-one with examples.

Using Data Parallelism (DP)

In this approach, the training data is divided and distributed among multiple processors or machines. Each processor runs its own copy of the same model in parallel. After completing a training step, all processors share and synchronize their results to ensure the learning is updated collectively. This method works best when the entire model can fit into the memory of a single processor.

Take a look at Figure 1. There are three processors, each with a GPU accelerator and a copy of the same model. The training dataset is also divided into three parts. Each processor takes one batch from it’s partition of the dataset and trains its own model copy for one step. At the same time, the other two processors also do the same.

After completing their individual training step, all processors send their gradients to a central parameter server. The parameter server calculates the average gradient, updates the final model weights, and shares the updated weights with all processors. This completes one training step.

Faster Training of Large Language Models with Parallelization — **Figure 1: Data parallelism technique for training deep learning models | Image by Author**

Now lets understand pipeline parallelism.

Using Pipeline Parallelism

Pipeline parallelism (PP) is essentially a type of model parallelism, where we distribute the layers of the neural network across processing devices (such as GPUs or TPUs). For example, a large model consisting of 100 layers can be distributed across 4 GPUs where each GPU handles about 25 consecutive layers. See the following illustration (Figure 2).

As per the Figure 2, the data moves from layer-1 to layer-25 on GPU-0. However, when it reaches layer-25 and needs to move to layer-26, it will travel from GPU-0 to GPU-1 through a communication channel. If both GPUs are on the same machine, the data transfer will be faster. But if GPU-0 and GPU-1 are on different machines, the data transfer cost can increase significantly. In such cases, the communication channel could become a bottleneck in pipeline parallelism.

Now, let’s explore Tensor parallelism.

Applying Tensor Parallelism

Tensor parallelism (TP) can also be used for fitting a large DL model in multiple GPUs. In this method, each GPU handles only a part of the tensor. Essentially, the large matrix calculations are divided among the GPUs, and the results are then combined to form the final tensor. The following paragraph explains TP with an example.

Take a look at Figure 3 and assume that A is the input tensor and B is the model weights tensor. The matrix multiplication can be broken down as shown in the figure. Specifically, the weights tensor B can be divided into two parts, B1 and B2, which can be stored on separate processors.

Each processor will then perform matrix multiplication on their slice of the tensor and calculate the results at the same time. Basically, processor 1 will give C1, and processor 2 will give C2. These results, C1 and C2, can be transferred from the processors and combined to form the final output tensor C when needed.

This technique allows you to spread your large model across multiple GPUs. Now, let’s move on to the next approach.

Using Zero Redundancy Optimizer (ZeRO)

Data parallelism and Model parallelism approaches have some known limitations such as —

Data Parallelism doesn’t reduce the memory footprint of the models. Large DL models with billions of parameters are difficult to into single GPUs with limited memory.
Model parallelism doesn’t scale well beyond a single node because the communication becomes an overhead.

ZeRO was developed by a Microsoft Research team to overcome the limitations of data parallelism and model parallelism. Unlike vanilla data parallel training, ZeRO doesn’t run exact replicas of the model across processors, It actually optimizes the data parallelism approach by partitioning the model states (parameters, gradients and optimizer states) across different data parallel processors. As ZeRO reduces per device memory usage, it allows you to fit very big models into multiple data parallel processors having smaller memories.

Following are three common ways of implementing ZeRO —

Optimizer State Partitioning (OS): OS reduces memory consumption by 4x, and doesn’t add any extra communication overhead to the existing data parallel setup.
OS + Gradient Partitioning (GP): If you combine OS with GP, it reduces the memory requirements by 8x, It also doesn’t add any communication overhead over the previous setup.
OS + GP + Parameter Partitioning: This setup can achieve memory reduction linear with the degree of data parallelism. For example, if your data parallel setup is using 32 GPUs, you can get a 32x reduction in the memory of each processor. This setup however increases the communication volume by about 50%.

The distribution of memory consumption of aforementioned three approaches can be seen in Figure 4. This illustration has been taken from the original research paper titled “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” published in 2020 by Microsoft Research Team.

BTW — What is DeepSpeed?

DeepSpeed is a deep learning optimization library developed by Microsoft that makes training of the large deep learning models faster and more efficient. It allows you to train 10x larger models, 10x faster with minimal code changes.

Which parallelism is suitable for you?

The best parallelism strategy for your unique use case can depend upon various factors such as — model size, number of available machines, number of accelerators (or GPUs) per machine etc. Let’s look at a few scenarios to understand the possibilities.

Single Machine — Single GPU

Model fits in the GPU: In this case, you can just go ahead and train your model normally.
Model doesn’t fit in the GPU: You can use ZeRO + CPU offloading.

Single Machine — Multiple GPUs

Model fits in a single GPU: You can use data parallelism. Optionally, you can augment it with ZeRO, but it may or may not provide any speedups given the situation and configurations.
Model doesn’t fit in a single GPU: You have multiple options: pipeline parallelism (PP), ZeRO augmented data parallelism and Tensor parallelism (TP). If the largest layer of the model is not fitting in a single GPU, you can combine PP with TP or implement ZeRO.

Multiple Machines — Multiple GPUs

If inter-node connectivity is fast: ZeRO would be easiest option to implement as it doesn’t require much code changes. Otherwise, you can implement PP+DP+TP, but this setup would require significant changes to the model.
If inter-node connectivity is slow: If you are low on GPU memory and have to utilize multi-node setup. You can implement DP+PP+TP+ZeRO all together.

Thanks for reading this article. I hope it was helpful. Please do share your feedback. See you in the next article!!

Let’s connect over LinkedIn.

Faster Training of Large Language Models with Parallelization

Faster Training of Large Language Models with Parallelization

Introduction