- Elon Musk just put a whole bunch of Nvidia chips to work.
- He said on Monday his company, xAI, brought its AI training cluster, Colossus, online.
- Built with Nvidia's H100 GPUs, the cluster could help Musk play catch up to Meta.
Elon Musk might be distracted right now by Brazil's Supreme Court over its decision to ban X, but he isn't letting that stop him from pushing forwards with his AI ambitions.
On Monday, the billionaire said xAI — the company he launched in July 2023 — had brought a massive new training cluster of chips online over the weekend, claiming it represented "the most powerful AI training system in the world."
The system, dubbed "Colossus," was built at a site in Memphis using 100,000 chips from Nvidia, specifically its H100 GPUs. According to Musk, the current cluster was built within 122 days and will "double in size" in a few months as more GPUs are added into the mix.
Though Musk previously confirmed the size of the cluster in July, bringing it online marks a key step forward for his AI ambitions and, critically, allows him to play catch-up with Silicon Valley nemesis Mark Zuckerberg.
Like the Meta chief, Musk's ambitions — to turn xAI into a company that advances "our collective understanding of the universe" with its Grok chatbot — depend on high-performance GPUs, which provide the computing power required for powerful AI models.
These haven't exactly been easy to come by, nor have they been cheap.
The hype generated around AI since the release of ChatGPT in late 2022 has left companies scrambling for Nvidia GPUs, with shortages stemming from frenzied demand and supply constraints. In some instances, they have been sold for upward of $40,000.
That said, these barriers to access haven't stopped companies from securing a supply of GPUs in any way they can and putting them to work to edge ahead of rivals.
Llama vs Grok
Nathan Benaich, the founder and a general partner at Air Street Capital, has been tracking the number of H100 GPUs acquired by tech companies. He puts Meta's total at 350,000 and xAI's at 100,000. Tesla, one of Musk's other companies, has 35,000.
Earlier this year, Zuckerberg said that Meta would have a massive stockpile of 600,000 GPUs by the end of the year, with some 350,000 of those GPUs being Nvidia's H100s.
Others, like Microsoft, OpenAI, and Amazon, haven't disclosed the size of their H100 pile.
Meta hasn't disclosed exactly how many GPUs Zuckerberg has secured from his 600,000 target and how many have been put to use. However, in a research paper published in July, Meta noted that the largest version of its Llama 3 large language model had been trained on 16,000 H100 GPUs. In March, the company also announced "a major investment in Meta's AI future" with two 24,000 GPU clusters to support the development of Llama 3.
It suggests that xAI's latest training cluster, with its 100,000 H100 GPUs, is much bigger than the cluster used to train Meta's largest AI model, as of July.
The scale of the feat hasn't been lost on the industry.
On X, a post from Nvidia's data center account in response to Musk read: "Exciting to see Colossus, the world's largest GPU #supercomputer, come online in record time."
xAI cofounder Greg Yang, meanwhile, had a more colorful response to the news that riffed on a song by American rapper Tyga:
Shaun Maguire, partner at venture capital firm Sequoia, wrote on X that the xAI team now "has access to the world's most powerful training cluster" to build the next version of its Grok chatbot. He added: "In the last few weeks Grok-2 catapulted to being roughly at parity with the state-of-the-art models."
But, as with most AI companies, there are big question marks over commercializing the technology. "It's impressive xAI has been able to raise so much with Elon and make progress, but their product strategy remains unclear," Benaich told Business Insider.
Back in July, Musk said the next version of Grok — after training on 100,000 H100s — "should be really something special."
We'll find out soon enough how competitive it makes him with Zuckerberg on AI.