3 Ways You Save By Optimizing ML Compute

Author

Petuum Team

Most large AI teams that we talked to are well on the way to optimizing their consumption of compute resources for ML. For smaller organizations, we’ve found that it is critical to think about the future of scaling. The scaling needs of ML come swiftly and bring with them the opportunity for massive productivity growth and savings, along with the danger of sudden cost overruns.

There are three ‘savings’ you make by optimizing ML compute: time, money, and the environment.

1 — Save Time

Saving time may be the single most important thing an AI Team can do to improve its chances of success with stakeholders. The average ML project takes anywhere between 4 to 6 months. There are a host of contributing factors that build up to inefficiencies in lead times, three of which are most clearly aligned to compute optimization:

Training time

Some teams train models within 15 minutes, but chain dozens of steps together. Others run through gigabytes of data over the course of weeks of training. Problems compound when you have multiple users jockeying for spots on the GPUs. We think the scheduling and acceleration solutions for these are obvious, but the fact is that most teams don’t have the time to even consider the time wasted.

Idle time

We believe that AI team members are some of the most valuable people in your organization. Yet we hear again and again of ML Engineers and Data Scientists putting jobs into motion and then sitting idle. Most people move on to the next task while waiting for training or tuning to complete, but the switching costs add up to a less productive (and more stressful) day. With faster compute, ML Practitioners move through workflows faster. This compounds for a better experience for your team, and better outcomes for the organization.

Lost time

AI Teams sometimes fail to see ML as R&D instead of regular development, and so the impact of ‘going back to the drawing board’ can be devastating for OKRs, as it is considered ‘lost time’. Managing expectations from standardized development goals to a research mentality may help in the short term, but it just shifts the pain to the business side.

We believe that resource efficiency can greatly reduce the time lost by AI Teams in experimentation. Faster trials mean faster iterations, which can mean the difference between success and failure of an entire ML project.

2 — Save Money

The burgeoning field of FinOps has a natural and strong intersection with the management of ML workloads, and for good reason. Training, tuning, and serving of ML models can be surprisingly expensive, even for simple projects. If not effectively managed, costs can easily exceed the budget for ML operations.

Compute instance cost

It is no surprise to anyone that the primary capital constraint for AI teams is measured in the computational capacity consumed in ML workflows. For cloud-only businesses, this is the cost of instances on the major cloud providers, which tends to increase so rapidly that it can be a surprise to the budget owner.

For the lucky teams that have on-prem GPUs, the capitalized cost of the hardware can be hard to amortize without a strong demonstration of value from utilization. Inefficient utilization of hardware can make the purchase of expensive (and increasingly rare) GPU systems seem like a bad choice when quarterly budgets come around.

Efficient resource scheduling

Apart from the actual cost per instance, poor choice of instance per job can be the cause of major expenses that otherwise go unnoticed. For example, we’ve heard from various teams that leverage GPUs for simpler experimental NLP tasks that could be completed on CPUs. The opportunity cost of this GPU capture (for example, the value of business-critical retraining jobs sitting on the sidelines) is harder to measure than the obvious instance cost for a job that could have been run on a cheaper system. Choosing the right cloud instance or on-prem node for the right job at the right time can save thousands in monthly budget for an AI Team.

Surging and capping

Sometimes the most frustrating compute aspect of FinOps is the potential for unpredictability. We talked to one team that unintentionally surged monthly spend from $5K to $100K+ on an inferencing system due to an unexpected demand spike. Stories like these are common when the team has not designed MLOps with FinOps in mind. Infrastructure-oriented planning can help prevent whiplash in the IT teams and prevent unexpected errors.

3 — Save the Environment

It is estimated that the cost of training MegatronLM was around 20KwH, twice the annual energy usage in the the average American home. On a smaller scale, you can consider the energy impact of running a single V-100 for an hour (around 250 Watts) and use it to estimate the carbon footprint of every job run by your team.

The carbon footprint of training, tuning, and experimenting on a single NLP pipeline is effectively 7x that of an average human being in a single year in terms of emissions. And the emissions involved in training a single transformer with neural architecture search can be over 600 tons of CO2, roughly equivalent to that produced by five cars over their lifetime.<https://arxiv.org/pdf/1906.02243.pdf>

Making the most of your compute resources in these ML tasks is not simply a matter of improved time to value and reduced cost — it is also a critical element of becoming a carbon neutral AI team.

Our Approach

At Petuum, we developed the AI Operating System with a focus on abstracting infrastructure management. The AI OS integrates efficient automated resource scheduling algorithms in every job that gets executed through the platform.

Open source AdaptDL, which made waves in OSDI with the best paper last year, and forthcoming tuning and NLP-centric optimization systems, are the cornerstone of the composable platform for managing ML projects. In our internal experiments and work with customers, we see anywhere between 2x to 10x resource savings in training, and up to 20x more speed in the tuning portions of project workflows.

Find out more at Petuum.com

Learn More

3 Ways You Save By Optimizing ML Compute

Latest articles

SlimPajama-DC: Understanding Data Combinations for LLM Training

AmberChat & Ambersafe: We're adding two new models to LLM360