Distributed AI Model Training in HPC Pack

2023-12-29

Background

In the present day, AI models are evolving to become more substantial, necessitating an increasing demand for advanced hardware and a cluster of computers for efficient model training. HPC Pack can simplify the model training work for you effectively.

PyTorch Distributed Data Parallel (aka DDP)

To implement distributed model training, it is necessary to utilize a distributed training framework. The choice of framework depends on the one used to build your model. In this article, I'll guide you on how to proceed with PyTorch in HPC Pack.

PyTorch offers several methods for distributed training. Among these, Distributed Data Parallel (DDP) is widely preferred due to its simplicity and minimal code alterations required from your current single-machine training model.

Setup a HPC Pack cluster for AI model training

You can setup a HPC Pack cluster using your local computers or Virtual Machines (VMs) on Azure. Just ensure that these computers are equipped with GPUs (in this article, we'll be using Nvidia GPUs).

Typically, one GPU can have one process for a distributed training work. So, if you have two computers (aka nodes in a computer cluster), each equipped with four GPUs, you can achieve 2 * 4, which equals 8, parallel processes for a single model training. This configuration can potentially reduce the training time to about 1/8th compared to single process training, omitting some overheads of synchronizing data between the processes.

Create a HPC Pack cluster in ARM template

For simplicity, you can start a new HPC Pack cluster on Azure, in ARM templates at GitHub.

Select the template "Single head node cluster for Linux workloads" and click "Deploy to Azure"

select an ARM template

And refer to the Pre-Requisites for how to make and upload a certificate for use of HPC Pack.

Please note:

You should select an Compute Node Image marked with "HPC". This indicates that GPU drivers are pre-installed in the image. Failing to do so would necessitate manual GPU driver installation on a compute node at a later stage, which could prove to be a challenging task due to the complexity of GPU driver installation. More about HPC images can be found here.
You should select a Compute Node VM Size with GPU. That is N-series VM size.

Install PyTorch on compute nodes

On each compute node, install PyTorch with command

pip3 install torch torchvision torchaudio

Tips: you can leverage HPC Pack "Run Command" to run a command across a set of cluster nodes in parallel.

Setup a shared directory

Before you can run a training job, you need a shared directory that can be accessed by all the compute nodes. The directory is used for training code and data (both input data set and output trained model).

You can setup a SMB share directory on a head node and then mount it on each compute node with cifs, like this:

On a head node, make a directory app under %CCP_DATA%\SpoolDir, which is already shared as CcpSpoolDir by HPC Pack by default.
On a compute node, mount the app directory like
```
sudo mkdir /app
sudo mount -t cifs //<your head node name>/CcpSpoolDir/app /app -o vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777
```
NOTE:
- The password option can be omitted in an interactive shell. You will be prompted for it in that case.
- The dir_mode and file_mode is set to 0777, so that any Linux user can read/write it. A restricted permission is possible, but more complicated to be configurated.

Optionally, make the mounting permanently by adding a line in /etc/fstab like

//<your head node name>/CcpSpoolDir/app cifs vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777 0 2

Here the password is required.

Run a training job

Suppose now we have two Linux compute nodes, each with four NVidia v100 GPUs. And we have installed PyTorch on each node. We also have setup a shared directory "app". Now we can start our training work.

Here I'm using a simple toy model built on PyTorch DDP. You can get the code on GitHub.

Download the following files into the shared directory %CCP_DATA%\SpoolDir\app on head node

neural_network.py
operations.py
run_ddp.py

Then create a job with Node as resource unit and two nodes for the job, like

job details

And specify two nodes with GPU explicitly, like

job resource selection

Then add job tasks, like

edit job tasks

The command lines of the tasks are all the same, like

python3 -m torch.distributed.run --nnodes=<the number of compute nodes> --nproc_per_node=<the processes on each node> --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=<a node name>:29400 /app/run_ddp.py

nnodes specifies the number of compute nodes for your training job.
nproc_per_node specifies the number of processes on each compute node. It can not exceed the number of GPUs on a node. That is, one GPU can have one process at most.
rdzv_endpoint specifies a name and a port of a node that acts as a Rendezvous. Any node in the training job can work.
"/app/run_ddp.py" is the path to your training code file. Remember that /app is a shared directory on head node.

Submit the job and wait the result. You can view the running tasks, like

view job tasks

Note that the Results pane shows truncated output if it's too long.

That's all for it. I'm hoping you get the points and HPC Pack can speed up your training work.

Share via