Perpetuum Mobile: AI Infrastructure

Author

Petuum Team

In 2010, Dr. Eric Xing took a sabbatical from his tenured position at Carnegie Mellon University teaching computer science to be a visiting professor at Facebook. This chance to see his world from another angle would lead him to reevaluate his work on the development of artificial intelligence.

Eric got his first look at the way industry giants use artificial intelligence (AI) and machine learning (ML) technology, both internally and in their products. Though the large-scale adoption of AI was ambitious and impressive, he saw a problem — implementation was hindered by the pre-industrial nature of the technology.

Eric saw a need for the industrialization of AI and ML technology. Instead of companies developing niche AI solutions internally, there could be a standardized, customizable infrastructure of pick-and-choose AI components. This would allow for the mass production and marketing of the technology, consequently dramatically lowering the cost of adoption. Where tech industry giants had exclusive access to AI and the exclusive talent needed to build it, now companies of any size and industry could benefit from it.

In order to create an AI infrastructure to deploy at a company-wide, even industry-wide, level, Eric needed to find a way to scale up from the programs run on individual machines by several orders of magnitude. He needed to figure out a way to clarify the complexity of the software and make it accessible to the average programmer. And he needed to figure out how to get the machines to work in harmony — the way all the instruments in a symphony work together to produce a unified song.

One way to do this is with parallel computing, where a program is run on many machines at once instead of on an expensive, specialized workstation. However, using the traditional approach to this parallel computing technique at such large scale caused inefficiency and resulted in discordant work, since the computers spent more time coordinating with each other than building a solution. Eric believed that the machines needed a conductor and musical score, something that told them when and how to harmonize.

The solution wouldn’t be found at a corporation, where the focus is less on research and development and more on one-off AI production. So Eric took this problem with him back to academia, where he and his CMU students and colleagues attacked the task of scaling up AI and ML technology for enterprise by isolating the problem and experimenting with new parallelizing and scheduling theories. The team sought a way to synchronize machines en masse — should they exchange information periodically, synchronously, or asynchronously? None of the existing approaches were sufficient, so the team continued to innovate programmatically and conceptually.

After over a year of study and trial, Eric’s team made an important conceptual leap: AI programs are not the same as conventional deterministic programs such as database programs, so they shouldn’t be structured like them. Where conventional programs are transactional and executional, following a series of steps in progression, AI programs should solve mathematical problems like people do, by iterating and learning and, when necessary, absorbing errors and taking unprecedented turns to get to the answer. This departure from classical computing theory led Eric and his team to the creation of a new theory for synchronizing machines — the stale synchronous parallel (SSP) parameter server.

The SSP parameter server would allow machines to harmonize by implementing a bounded asynchronous technique. After uniting the virtualized memory of multiple machines onto a server, the window of synchrony would be bound, limiting the exchange between worker machines and the master “conductor” machine. Unlike synchronous exchanges between parallelized machines, this allowed for a dramatic improvement in processing speed and, unlike unbounded asynchronous exchanges, this also avoided mistakes.

The model system worked like a beautifully conducted symphony. Machines came to be unified and coordinated automatically where before they had been either dissonant or discordant. This realization led Eric to a series of important conceptual leaps linking his team’s AI infrastructure model to his passion for music, through nomenclature.

One of the models his team attempted to apply to their synchrony problem was called the “structured asynchronous dynamic scheduler”, often shortened to “strads”. This term, “strads”, is also an abbreviation for a violin made by one of the best string instrument makers, the Stradivarius family. Eric immediately thought of his favorite violin piece by Niccolò Paganini and a symphonic piece by Johann Strauss II, both of which are perpetuum mobile compositions — compositions characterized by a continuous, repetitive stream of notes that wave in and out in rapid harmony. The music that is produced nears the edge of cacophony, but at the same time it appears simple and clean because of an expert conductor’s orchestration.

This was exactly Eric’s vision for how machines should synchronize to empower AI programs at scale — by working in perpetuum mobile, perpetual motion. With a parameter server conducting the machines and telling them when to harmonize, AI programs could be run across countless computers at once, continuously learning, iterating, and exchanging information with their server.

With this breakthrough in AI implementation for enterprise, the team of CMU researchers and students led by Eric became a company with a mission: to construct an infrastructure that would facilitate the wide-scale adoption of AI. And, by improving efficiency and ease of use while decreasing costs, they aimed to remove the barriers to AI implementation, making AI accessible to every industry and business.

Ready to move forward, Eric took another sabbatical from CMU to fully explore these discoveries. To do so, he borrowed from the term that so succinctly encapsulated the perpetually moving harmonic interface he planned to build, naming this new company Petuum.

‍

Learn More

Perpetuum Mobile: AI Infrastructure

Latest articles

SlimPajama-DC: Understanding Data Combinations for LLM Training

AmberChat & Ambersafe: We're adding two new models to LLM360