- All kinds of startups are rushing into the AI inference market.
- Inference market competition may lower the price of AI, benefiting builders but challenging clouds.
- Not all startups will survive the period of "chaos" to come.
Jared Quincy Davis and his AI computing startup Foundry sell inference. They don't make chips or build large language models. Foundry has a unique method of making cloud computing more efficient. Instead of selling its technology to cloud providers, the Foundry team decided to become one and use its tech to operate a more efficient cloud.
Once companies looking to leverage and sell an AI product have trained their models and know that they perform, they're looking for ease, speed, and value whenever generating outputs. Inference-as-a-service providers like Foundry, aim to simplify the process of generating those outputs.
Foundry offers training and fine-tuning, too, as many cloud providers do, but these days, it seems like anyone with an AI compute-boosting technology is attempting to monetize by selling inference — or more specifically, tokens, the base unit of data in AI.
Cerebras sells inference too. The company's core expertise is designing chips for training and inference, but it recently started selling the latter as a service. So does Groq, a chip company formed by two former Googlers, who recognized early that inference was going to get the bigger share of computing. SambaNova Systems, another hardware platform, also sells inference as a service.
Companies like Lambda, CoreWeave, Together AI, and Crusoe, all close partners of Nvidia, run data centers suited specifically to AI workloads and offer inference services. And then there are the hyperscalers like AWS and Microsoft Azure.
With so many companies specializing in inference, suspicion is rising that the cost of inference is about to drop off a cliff.
"Part of the reason inference is a little commoditizable is customers are kind of paying for tokens at the end of the day," Davis told Business Insider.
The current market for inference is kind of like the electricity market, Davis said. There are a ton of niche sources you can access if you actually shop around, but not everyone does. Most people just want to flip the light switch and have it work.
But there is a lot of nuance to sift through for those willing. For some customers, speed is of the utmost importance. Speed has distinctions too, like time to the first token and tokens per second. There's total job completion time and there are different kinds of inference workloads that lend themselves to different computing setups.
Energy efficiency of the underlying hardware and networking is a big determinant of cost. And cost in inference computing is even more important than in training, Groq cofounder Jonathan Ross recently told BI. Training is an overhead cost, while inference is an operating cost.
Zoom out from all of the intricacies, and inference is becoming the commodity of the AI age.
"Some companies just want output and they don't care about infrastructure," Mitesh Agrawal, head of cloud for Lambda, told BI.
Commoditizing AI
Lambda is in the early stages of an inference-as-a-service offering, but Agrawal said the company is going about it carefully, focusing on providing holistic computing services, and not just tokens.
Inference profit margins can vary widely, Agrawal said. With general compute — where the customer rents fixed capacity — the margins are easier to manage. When you're charging for usage or input and output of a model, the return is less predictable.
Organizing multiple users across a finite number of servers takes finesse. Whether or not the cost of operating the hardware is actually covered with room for profit comes down to how well that organization is done, Agrawal explained.
So why would neoclouds offer the riskier service?
Agrawal said it's about getting potential customers in the door. Inference-as-a-service customers can turn into more traditional compute customers, and as the slate of competitors grows, relationships, and history grow in importance.
Lambda's financial models assume that price cuts are coming soon as more players enter the inference space and chips become more efficient.
A race to the bottom?
How fast the demand for inference is growing is up for debate, but in recent public statements, Nvidia CEO Jensen Huang has said on multiple occasions that new models, like OpenAI's o1, require more compute to generate the same number of responses because they run multiple models to check their own work or "reason." Accuracy, it turns out, requires more compute.
Inference loads are poised to grow, but service providers still anticipate a drop in price from the influx of new players. Davis isn't worried though.
He recalled Jevon's paradox — an economic principle in which a drop in price or an increase in efficiency leads to more total consumption — like when you widen a highway and traffic gets worse.
"If I make something 10 times cheaper, people won't spend 10 times less, nor will they even hold their budgets the same. They'll spend more," Davis said. "That makes sense because what are you doing when you make something 10 times cheaper, you're making the ROI better."
In other words, "it turns out, when you make inference cheaper, people decide to do a lot more inference," Davis said.
The ride ahead could be "bumpy" though, and not all players are likely to survive the moments of mismatch between supply and demand.
"As my old boss at Intel Andy Grove used to say, 'Let chaos reign, and then reign in the chaos'," said Sriram Viswanathan, founding managing partner at Celesta Capital and investor in SambaNova Systems.
He agrees the next few years will be wildly competitive for inference providers, but he believes the winners will be decided on merit.
"The core innovation can't be in the go-to-market, but in the performance and power of the underlying architecture," Viswanathan said.
Many of the companies selling tokens to break into the AI market aspire to more. The chip designers eventually want to sell chips to hyperscalers rather than inference to AI startups. The ultimate version of Foundry's tech is bigger too.
"If we do our job, right, you know, we will be a core part of how every GPU runs," Davis said. All roads, it seems, run through inference.
Hugh Langley contributed reporting.
Got a tip or an insight to share? Contact Senior Reporter Emma Cosgrove at [email protected] or use the secure messaging app Signal: 443-333-9088