Achieve Highly Scalable AI by Treating it Like it is Drunk

Author

Petuum Team

By Victor Thu

Blasphemy! How can artificial intelligence (AI) be drunk!? We deployed AI so it can help us scale faster, not be drunk on the job!

Interestingly, that’s what led to Petuum’s breakthrough in scaling machine learning and artificial intelligence.

Let’s say we set up a race for two people: a sober person and a drunk person. The race is to climb to the top of a hill.

No surprise, the sober person finished the task very quickly. All he did was look to the top of the mountain and climbed straight up.

The drunk person, however, took a more scenic route. He first looked up the mountain and started heading to the top. Along the way, he wandered off into different directions. With moments of consciousness, he realized what his original task was, looked up to find the top of the mountain, continued on his path. This repeated several times until he reached the top of the mountain.

Now, I’m sure you are thinking that’s a very dumb way to win a race. And from a traditional software and IT standpoint, you would be right!

In a traditional software and IT world, we need to make sure that each sequential step in the process is correct before we can move to the next step. That’s because if there’s an error, it can propagate and be repeated over and over again and could cost the business a significant amount of money!

Furthermore, when it comes to traditional scaling, there’s a very standard calculation on how you would configure that. Typically, the more hardware resources you throw at your application, the better it performs and scale.

Many people apply this outdated way of thinking to artificial intelligence machine learning (AI/ML) scaling too. People believe that AI/ML applications follow the same paradigms as traditional software and IT. However, this is what causes many of the problems in AI/ML scaling.

In a sequential world, when any particular machine becomes a bottleneck, it essentially slows the entire machine learning (ML) process to a grind. Which is why in many instances, it doesn’t matter how many more compute or GPU resources you throw at the problem, there’s no gain in training performance.

What our founders, led by Dr. Eric Xing, have uncovered is that AI/ML doesn’t have the requirement to be precise early on. In fact, it is much more important to complete the training phase than to ensure accuracy early on.

Put it another way, the traditional model is akin to a mountain relay race with 100 racers each relying on the previous racer to solve a math problem before the next racer can proceed; if a single racer is stuck on his or her math problem, the entire team is delayed.

In the drunken person analogy with the other 100 racers, each racer will attempt to solve his/her part of the math problem despite not having info from the other racer. It might seem slow and maddening at first, but not one single racer would be slowed down by the other.

By distributing the training across the entire server cluster without having to worry about the early results, AI/ML training is never bogged down by a single misbehaving machine. As a result, unlike other platforms, we never run into a scalability plateau.

The accuracy of the model will eventually converge. In fact, in this model, it not only outperforms the previous methodologies, but it also allows us to cluster together much cheaper commodity hardware to perform what used to be done by expensive, high-end hardware.

This important paradigm shift allows Petuum to redefine how AI/ML can be done in a much more effective and efficient manner. In this case, the drunk person concept is exactly the breakthrough that will accelerate the future of AI!

‍

Learn More

Achieve Highly Scalable AI by Treating it Like it is Drunk

Latest articles

SlimPajama-DC: Understanding Data Combinations for LLM Training

AmberChat & Ambersafe: We're adding two new models to LLM360