Optimizing Elastic Deep Learning in GPU Clusters with AdaptDL for PyTorch
Authors: Aurick Qiao, Henry Guo, Qirong Ho — Petuum CASL Team
Reduce cost by 3x in the cloud and improve GPU usage in shared clusters for deep learning training (GitHub).
Deep learning models can be expensive and time-consuming to train. A language model such as BERT needs over 2000 GPU-hours to train, while computer vision models such as ResNet and VGG require at least 100s of GPU-hours. With today’s cloud compute costs, the training bill might be up to 4 or 5 figures in US dollars!
Given the time and dollar costs, organizations pool computing resources into shared clusters, where multiple users can each submit multiple training jobs. This not only cuts down wasted spend from over-provisioning, but also lets users train large models within days or even hours rather than weeks on a single workstation, by leveraging idle resources.
But shared clusters present their own numerous challenges and drawbacks. To address all the drawbacks associated with organizations pool computing and share clusters the CASL team at Petuum created AdaptDL to simplify and accelerate distributed training on shared clusters.
AdaptDL monitors training job performance in real-time, and elastically re-scales resources (GPUs, compute instances) while jobs are running. For each training job, AdaptDL automatically tunes the batch size, learning rate, and gradient accumulation. In the cloud (e.g. AWS), AdaptDL can auto-scale the number of provisioned Spot Instances. We’ve seen shared-cluster training jobs at Petuum and our partners complete 2–3x faster on average, with 3x cheaper cost in AWS using Spot Instances!
Shared Clusters Challenges and Drawbacks
AdaptDL was built to address some of the challenges and drawbacks that arise with shared cluster resourcing. A list of typical challenges associated with shared clusters usage include:
- Resource allocation pitfalls — Sharing a cluster requires careful planning of which job gets which GPUs. For example, it’s much faster to train a model using GPUs on the same machine instead of across different machines, and two different distributed jobs should not be assigned GPUs on the same machine because they will compete for network bandwidth.
- Uneven training speed and scalability — Choosing the right allocation of GPUs for a training job requires continuous monitoring of the job’s training speed and scalability, which changes over time. Larger batch sizes can be used when approaching convergence, so it’s best to use fewer GPUs at the start, and more GPUs towards the end.
- Training configurations — Some important training configurations are usually selected knowing which GPUs are available ahead of time, which is not always a reasonable assumption in shared clusters. For example, the batch size and by extension the learning rate are typically decided based on the number of GPUs. Or, Gradient accumulation may be used to overcome network bottlenecks, if it is known that the GPUs will be on different machines.
- Fairness and availability — During periods of high usage, some users may need to wait for other jobs to complete. However, user incentives are not always aligned with the “public good” of the cluster. One user might want to double the GPUs for their job to get an incremental improvement in training time, when those extra GPUs could be better allocated to a different user’s job.
AdaptDL Solves Many of the Shared Clusters Challenges and Drawbacks
AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework that can help address many of the challenges and drawbacks identified earlier. To tackle these challenges, AdaptDL enables you to:
- More efficiently utilize shared GPU clusters — For each training job, AdaptDL learns how it performs with different allocations of GPUs. Using this learned knowledge, the AdaptDL scheduler allocates GPUs to multiple jobs in a way that is both fair and efficient. AdaptDL elastically re-allocates GPUs as new jobs start and old jobs finish, and as its knowledge of each job’s performance characteristics improves over time.
- Reduce the dollar cost of training in the cloud — AdaptDL can provision the right number of GPU instances in the cloud to avoid wasting money. When larger batch sizes can be used later in training, AdaptDL automatically expands the cluster. Spot instances in AWS can also be used to drastically reduce the cost of renting cloud resources, and AdaptDL will automatically handle spot instance evictions.
- Easily realize the benefits of large-batch training — Using large batch sizes can be a great way to accelerate training on many GPUs, but it can also be challenging to apply. Depending on the model, using a too-large batch size can result in a longer training time due to decreased statistical efficiency, while using a too-small batch size can fail to efficiently utilize the GPUs. AdaptDL helps to automatically select a batch size that is “just right”. This functionality can be leveraged in shared clusters, cloud environments, and on your own local machines.
If you are interested in further technical details, AdaptDL’s scheduling algorithm is based on our research paper which will appear at OSDI’21!
AdaptDL can be used in two modes.
- Cluster scheduling lets multiple jobs run on a Kubernetes cluster. Using the adaptdl Python library, your PyTorch training code will work together with the AdaptDL scheduler to automatically use the best number of GPUs and training batch size.
- Standalone training lets you train your model using adaptive batch sizes and learning rates in any cluster or on your local multi-GPU machine. AdaptDL automatically figures out when a larger batch size can be used to speed up your model training.
Training with the AdaptDL Python Library
The adaptdl Python library makes it easy to write PyTorch training code that is elastic with automatic adaptive batch sizes and learning rate scaling.
For instance, starting with the PyTorch MNIST example (link), you just need to modify a few lines of code, as follows.
AdaptDL provides a similar interface to native PyTorch distributed data-parallelism so that existing distributed training code can be easily modified.
First, replace uses of torch.utils.data.DataLoader with adaptdl.torch.AdaptiveDataLoader.
AdaptiveDataLoader automatically uses the best batch size during training according to your program’s throughput and statistical efficiency. It also saves its state when taking a checkpoint, so that training can resume from where it left off after restarting.
train_loader.autoscale_batch_size(1024)tells AdaptDL to automatically select the most efficient batch size for training, up to a maximum global batch size of 1024 summed across all training processes.
Next, wrap the model with adaptdl.torch.AdaptiveDataParallel. AdaptiveDataParallel measures the Gradient Noise Scale during training which is used to calculate statistical efficiency. When the batch size is changed, AdaptiveDataParallel will automatically adjust the learning rate according to a learning rate scaling rule. By default, AdaptiveDataParallel uses AdaScale, which provides good performance across a variety of tasks.
During checkpoints, AdaptiveDataParallel automatically saves the model parameters, optimizer state, and LR scheduler state, and restores them after restarting.
With these changes, you can run your training code on your local machine or in a distributed cluster (see instructions)! AdaptDL selects the right batch size and learning rate for faster distributed training, and automatically performs gradient accumulation to overcome bottlenecks in the network. The following graph shows the training times for YOLOv3 on two machines each with two GPUs.
Without AdaptDL, choosing a batch size that is too small results in longer training times due to under-utilizing the GPUs. On the other hand, choosing a batch size that is too large also results in longer training times due to needing more epochs to converge. AdaptDL eliminates the need to choose a fixed batch size, and automatically achieves better training performance!
Cluster Management with the AdaptDL Scheduler
The AdaptDL scheduler brings even greater benefits for training in shared clusters by automatically deciding how many GPUs your training job uses. Leveraging elasticity, your job will expand to use additional GPUs when the cluster congestion is low and shrink to use fewer GPUs (rather than being suspended) when the cluster congestion is high. The AdaptDL scheduler also offers additional functionalities such as defragmenting the cluster to avoid network contention between different jobs, and adjustable fairness between competing jobs.
AdaptDL can achieve very efficient utilization of shared clusters due to co-adaptation between the scheduler and each training job. When a job can efficiently use larger batch sizes, AdaptDL automatically diverts more GPUs towards that job to accelerate training. On the other hand, when only smaller batch sizes can be used, additional GPUs will be more efficiently allocated to other jobs.
The AdaptDL scheduler can be installed on any Kubernetes instance in one command using Helm:
Once the AdaptDL scheduler is installed, you can submit training jobs using the adaptdl CLI (see instructions). Your job will start out using a single GPU, and restart a few times using different numbers of GPUs while AdaptDL figures out the best number of GPUs to use! No matter how many GPUs, AdaptDL will always use the most efficient batch size and scale the learning rate accordingly.
Thus by using AdaptDL, we have observed our PyTorch training jobs complete 2–3x faster in shared clusters! Additionally, the AdaptDL scheduler enables usage of AWS spot instances which can be 3x cheaper (see instructions here). Finally, AdaptDL can be leveraged with NNI to accelerate hyper-parameter tuning workloads (AdaptDL + NNI Post).
We’re improving AdaptDL — our top priorities include better monitoring, an easy-to-use GUI, and tighter integrations with other OSS projects including our CASL family. We’ll be announcing these features as soon as they’re ready at https://casl-project.ai.
If you are interested in AdaptDL, we are currently seeking users and contributors! Let us know your use cases and what features are important to you (by opening a new issue here), we are happy to collaborate with you to make them happen!
CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.
Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and AdaptDL announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!