- Companies are leveraging AI agents to execute multi-step tasks.
- Agents are now used for everything from email management to complex engineering.
- But researchers say agent errors are prevalent and compound the more steps they take.
Silicon Valley is brimming with optimism about AI agents.
In basic terms, the technology can solve problems, execute tasks, and grow smarter as it learns from its environment. Agents are like a virtual assistant, something most workers dream of having. They’re already using them to book flights, collect data, summarize reports, and even make decisions.
But agents are far from perfect, and not only are errors and hallucinations still commonplace, they get worse the more they’re used.
Companies are now using agents to automate elaborate, multi-step tasks. New tools have emerged to make that possible. Regie AI uses “auto-pilot sales agents” to automatically find leads, draft personalized emails, and follow up with buyers. Cognition AI makes an agent called Devin that carries out complex engineering tasks. Big Four professional services firm PwC unveiled “agent OS” a platform that makes it easier for agents to communicate with one another to execute tasks.
But the more steps an agent takes to complete a task, the more likely its error rate — the percentage of incorrect outputs — will impact the outcome. Some agent processes can have as many as 100 steps or more, according to Patronus AI, a startup that helps companies evaluate and optimize AI technology.
Patronus AI measured the risk and revenue loss caused by the mistakes of AI agents. Its findings confirm a familiar truth — with great power comes great responsibility.
"An error at any step can derail the entire task. The more steps involved, the higher the chance something goes wrong by the end," the company wrote on its blog. It built a statistical model that found that an agent with a 1% error rate per step can compound to a 63% chance of error by the 100th step.
ScaleAI growth lead Quintin Au said error rates are much higher in the wild.
"Currently, every time an AI performs an action, there's roughly a 20% chance of error (this is how LLMs work, we can't expect 100% accuracy)," he wrote in a post on LinkedIn last year. "If an agent needs to complete 5 actions to finish a task, there's only a 32% chance it gets every step right."
DeepMind CEO Demis Hassabis said at a recent event to think of error rate like "compound interest," according to Computer Weekly. By the time it works through the 5,000 steps it needs to execute a task in the real world, the probability it's correct could be random.
"In the real world, you don't have perfect information," Hassabis said at the event, according to Computer Weekly. "There's hidden information that we don't know about, so we need AI models that are able to understand the world around us."
The higher probability of failure for AI agents means that companies are at greater risk of losing their end customers.
The good news is that guardrails — filters, rules, and tools that can be used to identify and remove inaccurate content — can help mitigate error rates. Small improvements "can yield outsized reductions in error probability," Patronus AI said in its post.
Patronus AI CEO Anand Kannappan told BI that guardrails can be as simple as additional checks to ensure agents don't fail while they're operating. They can "prevent the agent from continuing or kind of ask the agent to retry," he said.
"That's why it's so important to measure performance carefully and holistically," Douwe Kiela, an advisor to Patronus AI and cofounder of Contextual AI, told BI in a LinkedIn message.