4 Ways to Scale MLOps Through Better Infrastructure Orchestration

Author

Petuum Team

This blog post is a continuation of our series 5 Reasons Why AI Fails to Scale. To see the last blog in this series, click here.

If you ask the leader of an AI team what are their most critical areas of focus, they may mention stakeholder concerns, managing costs, or retaining their superstar talent. But they will sometimes also mention the troubles with infrastructure orchestration, which directly impact all of those issues, and more.

In fact, when asked why an ML projects had to be scrapped, AI Team Leaders most commonly answer that it is due to ‘lack of infrastructure.’ Infrastructure is most commonly cited as the biggest pain point slowing down the AI roadmap¹.

In our experience, the secret to unlocking the power of scale in AI workloads is at the infrastructure abstraction layer: the space where system admins with Kubernetes grievances butt heads with ML engineers with rapid and bespoke deployment needs, while data scientists are left on hold requesting resources or new software. We believe that if you get infrastructure orchestration right, there are a few critical enablers of scale that can be unlocked. Here are four of these scale-enablers that we’ve discovered at Petuum that result from elevating infrastructure orchestration:

1 — Productivity gains in resource intensive tasks

The most easily measured outcome of organizing your AI team around optimized infrastructure orchestration is in the value per unit cost of the infrastructure. ML workloads are generally more resource-intensive than other uses of infrastructure, and AI teams begin to appear as growing cost centers to CIOs concerned about budgets.

By effectively organizing data and AI around a Kubernetes architecture, AI teams can take advantage of efficient cluster-level resource management, automatic scaling, and all of the benefits of containerization. These efficiencies by themselves can yield much more effective utilization of resources, which are easily tracked and monitored through a single pane of glass system like Petuum AI OS.

Resource schedulers like AdaptDL optimize resource productivity specifically for ML and deep learning workloads. For instance, a multi-node distributed training job can be completed in a tenth of the time while maintaining accuracy.

2 — Heavy collaboration and cross-domain expertise

The primary reason AI teams may not be using Kubernetes to its full potential is difficulty — effective Kubernetes orchestration requires talent and expertise. For most ML Engineers, the opportunity cost to mastering Kubernetes is too high. Meanwhile for the admins, HPC controllers, and other engineers with the requisite mastery of containerization, the knowledge of ML systems and their unique intricacies can be too weak to optimize effectively.

Hiring or training an MLOps Engineer from scratch is usually worth it, but even if the budget is justified, it can take a while to actualize. This is where an abstraction system built for ML Engineers can be helpful. With the Petuum AI OS, data scientists, software engineers, and others without an understanding of Docker are able to spin up new infrastructure graphs within minutes, complete with bespoke application deployments and integrations.

3 — Platform-based integration needs heavy customization

For teams that are able to get the right people with the right tools to begin infrastructure orchestration, the first difficulties with scale begin at the point of application and data integration. A simple system involving external data sources, a knowledge graph, web APIs, and an ML experimentation system with repositories and tuning and training apparatus, can take a great deal of effort to set up, let alone optimize.

When the project begins to grow and add new software and access points, the effort in integration goes up rapidly. Even with Kubernetes workflow orchestration management systems like Kubeflow, the bespoke integration needs required by ML projects can be frustratingly hard.

Some tools may help improve productivity in these integrations, and cloud platforms in particular tend to offer smoother integration experiences across their suites targeting AI. However, the production-level deployment needs of customer-focused AI applications tend to require capabilities that are rarely at par on closed platforms.

We’ve found that with an extensible and composable platform for managing containerized applications and the flows between them, integration difficulty can be mostly eliminated for engineers. Instead of rewriting the integration code for AI applications, engineers can simply adjust a configuration file representing a graph of the entire infrastructure. The Petuum AI OS takes this “infrastructure-as-graph” approach to trivialize interoperability across software. The steps to attach a new component to the system, for example a Tensorboard for monitoring, is as simple as loading a default-configured application from a library, then connecting it to a mounted file volume.

4 — Modular and reusable systems for continuous improvement

As an AI project continues to scale, the need to revisit the infrastructure layer grows, as does the effort required by an MLOps Engineer to make sure that operations continue to run smoothly. Existing AI applications need to be improved for automation, monitoring, or more human-in-the-loop processes. Even a well-built system needs to be upgraded every year or two, simply because of the rate of innovation in the AI space. Without having a highly modular platform for infrastructure abstraction, engineers need to spend significant effort in rewriting and testing for system upgrades, which result in downtimes.

Without modularity in the application infrastructure, there is also limited ability to reuse the work done for a successful project. Every new iteration of a system, such as a customization for a new type of customer, comes with overhead.

Using a platform built on the principles of composability means that the orchestrated systems used by teams can be copied over as entire applications, tweaked with new flows and different applications, and deployed with almost no overhead. With the Petuum Platform, these infrastructure graphs are not limited by the Kubernetes environment, and can be deployed anywhere. For instance, we’ve written a complete infrastructure for an NLP system on a local machine and sent it to a customer to deploy without needing to make any adjustments for environment.

To learn more about how the Petuum Platform can help your organization better address your infrastructure orchestration needs through our Kubernetes abstraction, visit our website at www.petuum.com and apply for our private beta!

[1] CometML 2021 ML Practitioner Survey

‍

Learn More

4 Ways to Scale MLOps Through Better Infrastructure Orchestration

1 — Productivity gains in resource intensive tasks

2 — Heavy collaboration and cross-domain expertise

3 — Platform-based integration needs heavy customization

4 — Modular and reusable systems for continuous improvement

Latest articles

SlimPajama-DC: Understanding Data Combinations for LLM Training

AmberChat & Ambersafe: We're adding two new models to LLM360