Large language models revolutionized AI. LLM agents are what’s next
AI agents built on large language models control the path to solving a complex problem. They can typically act on feedback to refine their plan of action, a capability that can improve performance and help them accomplish more sophisticated tasks.
AI agents built on large language models control the path to solving a complex problem. They can typically act on feedback to refine their plan of action, a capability that can improve performance and help them accomplish more sophisticated tasks.
LLMs have quietly transitioned from stand-alone assistants to nearly autonomous agents capable of outsourcing tasks and adjusting their behavior in response to inputs from their environment or themselves. These new quasi agents can now run applications to search the web or solve math problems, and even act on user prompts to check and correct their work.
“The agent is breaking out of chat, and helping you take on tasks that are getting more and more complex,” said Maya Murad, a manager in product incubation at IBM Research. “That opens up a whole new UX paradigm.”
On their own, LLMs struggle with basic math, logic, and any question requiring knowledge that goes beyond what they learned during training. But with an engineered workflow, LLMs can now call on external tools and APIs to compensate for their weak spots and act on prompts to review their work. This ability to select and coordinate tools through function calling, and critique and adjust their behavior, are some of the hallmarks of today’s shift toward LLM-based AI agents.
LLMs are also increasingly capable of planning and acting, mirroring how us humans reason through problems and refine our approach. They are rapidly getting better at analyzing a task, and formulating a plan to accomplish it, often after repeated loops of self-critique. Rather than being told explicitly how to solve a problem, the next generation of LLMs will be tasked with figuring it out on their own.
To handle these increasing demands, engineers are adding on modules to LLMs to boost their memory, planning, and tool-calling abilities. A team led by researchers at University of California at Berkley has described this shift from monolithic models to multi-component systems as compound AI.
Whether these compound systems rise to the level of full-fledged agents is currently a hot source of debate. Meanwhile, researchers are experimenting with adding multiple LLMs to the mix. The hope is that multiple agents can build on and review each other’s work to come up with solutions that neither alone could have devised.
We haven’t fully entered the world of AI agents just yet, said Ruchir Puri, chief scientist at IBM Research, but we are getting close. “Move over ChatGPT,” Puri likes to say. “The agents are coming to town.”
Our brains swing between two modes of thought: a fast, automatic track, and a slow, deliberative track, according to Daniel Kahneman, the psychologist who wrote the best-seller, Thinking, Fast and Slow. These alternating cognitive states, said Murad, are a good way of thinking about AI systems.
Some AI systems can be trained to “think” fast and follow a fixed workflow. Others can be trained to “think” slow and follow a more variable workflow, giving them the ability to plan, reflect on their actions, and self-correct. In designing the system, the engineer decides how much control the agent should ideally have over its behavior.
“I like to think of it as a sliding scale of autonomy,” said Murad. “For narrow, well-defined problems, a programmatic approach could be more efficient. But if I have a complex task, or a spectrum of queries, a dynamic agentic workflow could be helpful because it’s more adaptable.”
Let’s say you want to know how many vacation days you have left, the type of question that a stand-alone LLM has trouble answering because of its static knowledge base. Today’s LLMs can simply make an API call to your company’s HR database to get the relevant facts, call up a calculator to do the math, and insert the correct number in its generated response, using a fixed workflow called retrieval-augmented generation, or RAG.
RAG allows the agent to think fast. The path to answering the question has been tightly scripted by an engineer. Summarizing a meeting transcript or translating a long document from one language to another are other examples of fixed workflows.
The ability to call on tools does not in itself make an LLM an agent, said Murad. If the workflow you design gives the LLM freedom to decide which tool to apply, you now have an agent. “You’ve added variability to the system,” she said. “The solution path isn’t predetermined.”
At the opposite end of the spectrum are dynamic workflows that allow the agent to think slow and tackle more complex problems one step at a time. The engineer designs an open-ended workflow to give the agent leeway to find the best problem.
Let’s say you want to plan a vacation based on your budget and the long-range weather forecast for a particular region. The AI agent needs to analyze the problem, break it into sub-tasks, and come up with a plan for fetching the relevant information. It must then evaluate the options, make a recommendation, and potentially revise its decision based on your feedback. The agent could also consult a log of past conversations stored in memory to offer a more personalized interaction.
A dynamic workflow gives the LLM more flexibility. Other applications could be asking a software engineering agent to fix a bug and run unit tests to evaluate its proposed patch or asking a site reliability engineer to quickly diagnose and resolve an IT incident before the network crashes.
Today’s leaderboards are full of examples of LLM-based agents outperforming stand-alone LLMs on specialized tasks, from searching the web to coding, in part because engineered workflows often include intensive evaluation of results.
Building an AI workflow around an LLM has become faster and cheaper, in many cases, than gathering more data to train up bigger models. This is exciting, wrote the team at Berkeley that coined the term compound AI, “because it means leading AI results can be achieved through clever engineering, not just by scaling up training.”
Adding more agents to a system can amplify the effect. In a multi-agent system, one model may farm out sub-tasks to other models specialized in language translation, for example, or code-writing, or calculate an answer and have a second model check its work.
But engineered workflows that give greater autonomy to one or more models come at a price. They are computationally expensive and can be slow at inference time. Like humans, multi-agent systems can get stuck in endless loops when they can’t agree on an answer. And the more agents that are involved, the harder it becomes for the coordinating agent to sort through the team’s comments and come to a decision, a bit like an overworked air-traffic controller.
Several technical challenges remain before LLM assistants can be transformed into bona fide agents. One is a larger context window. This is the amount of text LLMs can hold in a kind of short-term memory as they process and respond to a request. The context window must be long enough to fit the user’s instructions, dialogue between agents, and their ultimate solution. Another challenge is the ability to plan and change course based on feedback from the environment and user — what us humans think of as reasoning about the external world.
Researchers have made rapid progress extending LLM context length and augmenting their memory in other ways. They are also improving LLMs’ reasoning capabilities through orchestration, or creating dynamic workflows that help the agent decompose the task into sub-tasks, and formulate, test, and execute a plan.
Other challenges involve addressing some of the inherent weaknesses of LLMs themselves. They tend to hallucinate, or spout falsehoods and irrelevant facts, when they don’t know the answer. They can be tricked, through a malicious prompt, into divulging confidential information. They are also prone to regurgitating biases, hate speech, and personal data that may have inadvertently slipped into their training data.
Giving LLMs more freedom to interact with the outside world has the potential to magnify these risks. Take a task like code writing. “So many things could go wrong when you give an agent the power to create and run code as part of the path to answering a query,” said Murad. “You could end up deleting your entire file system. Or outputting proprietary information.”
These risks can be contained, she said, by executing code in a secure sandbox, installing security guardrails, running adversarial tests, known as red teaming, and enforcing company policies about how and where to share data.
The risks for each AI-agent use-case will need to be thought through and addressed. “Before the agents really get down to business, engineers have more work to do,” Murad said.
At IBM Research, engineers have improved the function-calling and reasoning capabilities of IBM’s open-source Granite models. They also continue to work on methods for evaluating LLMs for bias and safety. Trust and transparency will only become more important as humans delegate more control to LLMs.
Some of this work has already attracted notice. IBM’s Granite models recently received one of the top scores on the Stanford University Foundation Model Transparency Index. And IBM’s LLM coding agent, Agent-101 recently made the top 10 on SWE-bench Lite, a popular software programming benchmark hosted at Princeton University.
While most of IBM’s competitors used two or more models to complete the task, IBM did it with just one — a GPT model that researchers hope to replace soon with an open IBM Granite code model. “Our goal is to put an AI agent in the open so that other researchers can replicate our results,” said Avi Sil, an IBM researcher leading the team behind Agent-101.