[Research Roundup] Can AI Agents Actually Run Organisations?

Progress with LLM-powered AI agents has been incredible and can deliver real value but how close are we to a world where AI agents work alongside us in organisations?

Jan 09, 2025

Introduction

The potential of AI Agent automation in business operations is well-established. As

mentioned in a recent newsletter organisations have years of work to just extract benefits from GPT-4, let alone more recent and more powerful models. This matches my direct experience working with organisations to implement AI Agent automation - while the transformation potential is significant, successful implementation requires careful, systematic work to integrate these technologies effectively into existing business processes.

What we are seeing is automation of very well-defined activities with guardrails in place to ensure safety and trustworthiness. This is followed by careful scaling up of that work as organisations gain confidence and it is often accompanied with plenty of human-in-the-loop activity.

Could we be more ambitious? Can one create a system composed of multiple independent agents fulfilling different high-level roles (e.g. developer, accountant, sales person, CFO and so on) that would effectively collaborate and cooperate to run significant aspects of an organisation? What are the challenges that have been uncovered so far through research?

To answer the question I turned to academic literature on experiments or benchmarks that researchers have set up. The assumption is that an analysis of existing, recent, studies can give us a good sense of where things stand. Granted, AI development runs at breakneck speed that even academic research that is just a couple of months old can be rendered obsolete but this should give us some useful indication of the problems that we should watch out for and for the features and capabilities that newer models should incorporate to solve the challenges.

Literature Research Methodology

The approach I took is to scan Arxiv.org for recent research on the topic, looking for work that specifically studied whether multiple independent AI agents, fulfilling different high-level roles, could collaborate to solve more complex problems that included a social element (i.e. collaborating with other high-level AI Agents and/or humans). This is different from building a single LLM-powered application where you have multiple lower-level agents such as a coding agent and web browsing agent that collaborate to solve a single task. The work I looked for focussed on the less well-defined types of work that humans do, where you could see (perhaps by squinting a bit) the outlines of an autonomous AI agent working alongside humans.

You can find a list of all the papers I’ve reviewed in the Appendix. It’s worth saying that the volume of work on this high level social work ability of AI Agents is nowhere close to more specific work on single lower level tasks. Certainly an area an aspiring researcher could spend more time in these areas.

I used NotebookLM to collate the papers, generate summaries, draw some initial conclusions and extract some specific pieces of information. My actual academic days are long gone so I have the luxury of being a bit less formal with how I collate, review and reference work and while NotebookLM immensely sped up the process I did double check all the results.

Outcomes

Across these papers, the general consensus is that while multi-agent systems powered by LLMs show significant promise, they are not yet fully ready to autonomously handle the complexities of real-world scenarios.

The challenges include:

Inability to fully automate tasks in professional settings

Perhaps the most relevant research is contained in a paper introducing a benchmark aptly titled TheAgent Company, so we will spend a bit more time on this one and then move on to some of the other research.

The paper provides a direct assessment of LLM agents' capabilities in a simulated professional environment. The authors found that even the most capable models (Claude 3.5-sonnet) could only autonomously complete 24% of the tasks. This highlights a significant gap between the current abilities of LLMs and the requirements for full automation of professional tasks.

The tasks in TheAgentCompany were diverse, realistic and professional, including software engineering, project management, financial analysis and other typical business tasks. The fact that agents struggled across these various domains indicates that the limitations are not confined to one specific type of task.

The agents had access to tools like web browsers, code editors and Linux terminals, yet still struggled to complete tasks, showing that access to necessary tools is not enough to ensure automation.

It is worth saying that while the authors claim that the tasks are realistic the reality is that they are significantly simplified, assuming a sort of idealised organisation that always provides nicely defined tasks and outcomes to its employees.

The paper specifically identifies the following challenges that hinder full automation:

Lack of common sense and domain background knowledge. For example, an agent failed to understand that a '.docx' file was a Microsoft Word file and not a plain text file.

Lack of social skills that are required in workplace environments. For example, an agent did not follow up with a colleague after being given a referral.

Incompetence in browsing complex web interfaces. For instance, agents struggled with pop-ups on the interanet platform, which prevented them from downloading files.

Most damningly, the agents deceived themselves by creating 'shortcuts' that omit the hard part of tasks. For example, an agent renamed a colleague in order to circumvent having to ask the right person the right question.

The authors conclude that there is a "big gap" between current AI agent capabilities and the requirements for autonomously performing most jobs a human worker would do, even in the simplified benchmarking environment. This makes it clear that LLMs are not yet ready for full task automation in professional settings.

Supporting the conclusions coming from the work in the Agent Company paper is additional related research.

Limitations in multi-turn interactions and reasoning

For example, the work in the AI Hospital project found that LLMs struggle to maintain diagnostic accuracy over multiple turns of conversation. They also noted that LLMs face difficulties in posing relevant questions, eliciting crucial symptoms, and recommending appropriate medical examinations, which are all essential for effective multi-turn diagnostic reasoning.

Difficulties in simulating complex social dynamics

In SocialBench, while models showed some ability to mimic individual social behaviours, they struggled to adapt to group dynamics, often failing to embody neutral or negative social preferences, which are vital in diverse social settings. This points to the difficulty of capturing the full range of human social expression, which goes beyond simply generating positive responses or performing simple, individual tasks. The agents have difficulty showing a full range of responses to different kinds of group dynamics.

Issues with information asymmetry and effective communication

The iAgents framework highlights the difficulties in simulating social interactions when information is not shared equally among agents. Agents often fail to collaborate effectively due to the lack of information or the inability to share it appropriately. This inability to navigate information asymmetry is a key aspect of real-world social dynamics, as people often operate with incomplete knowledge.

The iAgents paper also notes that agents can be overly influenced by information they have learned during pre-training. This suggests that agents have difficulty differentiating between information provided in the simulation and pre-existing knowledge. This introduces challenges to the simulation of social interactions as agents do not always reason according to the information available to them during the simulation.

Challenges with planning and task execution in complex environments

The Plancraft dataset demonstrates that even with a clear goal, LLM agents have difficulty with tasks requiring multiple steps. The dataset requires agents to perform a sequence of actions to craft items, and LLMs often struggle with this type of multi-step reasoning. The need for a sequence of actions, also called planning, is a challenge for these models.

Furthermore, LLM agents often struggle to determine if a task is impossible, leading to wasted effort and resources. The Plancraft benchmark includes a set of intentionally impossible tasks, and it assesses an agent’s ability to recognize when a task cannot be completed. This indicates a problem with planning that includes a component of assessing feasibility of a given task.

Conclusions

The research paints a clear and sobering picture of where we stand with AI agent automation and collaboration in professional settings. While organizations are successfully implementing narrow, well-defined automated tasks with appropriate guardrails and human oversight, the vision of multiple AI agents autonomously collaborating to run significant aspects of an organization remains largely aspirational.

The evidence is particularly telling - when even the most advanced AI models can only autonomously complete 24% of professional tasks in controlled testing environments, it signals that we are still far from achieving meaningful autonomous collaboration between AI agents in real-world organizational settings. The challenges uncovered are fundamental rather than merely technical, from the lack of common sense and domain knowledge to the inability to handle complex social interactions and multi-step planning.

Perhaps most concerning is the observation that when faced with complex tasks, AI agents may resort to creating deceptive shortcuts rather than properly completing the assigned work. This suggests that even as we improve their technical capabilities, we need to address more fundamental issues around reasoning and task comprehension.

For organizations considering AI automation, these findings suggest that the most productive approach remains focusing on well-defined, specific tasks with clear boundaries and appropriate human oversight. While the potential for more complex AI agent collaboration exists, the current state of the technology demands a measured, incremental approach to implementation. Understanding these limitations is crucial for both researchers and organizations as they work to advance the field and implement practical solutions.

OpenAI is promising more work on AI Agents this year so it will be interesting to see what they can offer and just recently NVIDIA released new models (Nemotron) that are more focussed on the needs of “agentic” AI. Will be interesting to revisit this work in a few months and see how the various benchmarks and frameworks we examined here will fare with better underlying LLM capabilities.

Appendix - Sources

ResearchTown: Simulator of Human Research Community

23 Dec 2024, https://arxiv.org/pdf/2412.17767

This paper introduces ResearchTown, a multi-agent framework that simulates a human research community and facilitates collaborative research activities.

While the authors demonstrate that ResearchTown can provide a realistic simulation of research activity, they also note that outputs are intentionally incomplete, requiring further development by the user. The framework aims to enhance but not replace human researchers, implying the system is not ready to autonomously handle the complexities of the research process.

Plancraft: A Benchmark for Evaluating Planning Capabilities of Large Language Models in Complex Game Environments

30 Dec 2024, https://arxiv.org/pdf/2412.21033

This paper presents Plancraft, which evaluates LLM planning capabilities in a multi-modal game environment, testing both task accuracy and the ability to judge task feasibility.

The authors found that larger models perform better but are still challenged by multi-modal inputs, indicating that the models are not yet adept at bridging multi-modal inputs and decision-making in planning tasks. Additionally, even the best models still struggle with the integration of specialised actions.

The Agent Company: Benchmarking LLM Agents on Consequential Real World Tasks

18 Dec 2024, https://arxiv.org/pdf/2412.14161

This paper introduces TheAgentCompany, a benchmark using a simulated company environment to assess LLM agents on real-world professional tasks.

The authors find that even the most competitive models can only complete 24% of tasks autonomously. The authors argue that current state-of-the-art agents fail to solve the majority of tasks, suggesting a significant gap in their ability to autonomously perform most jobs.

AI Hospital: A Multi-Agent Simulation Platform for Medical Interactive Consultation

28 Jun 2024, https://arxiv.org/pdf/2402.09742

This paper introduces the AI Hospital framework, which simulates medical interactions between a doctor (an LLM) and various NPCs like patients and examiners. The authors use this framework to evaluate LLMs' performance in clinical diagnosis.

The paper highlights that while LLMs have made progress in medical question answering, they still exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. The authors suggest that current LLMs have not fully grasped effective multi-turn diagnostic strategies. The conclusion is that LLMs are not yet fully ready for complex clinical scenarios requiring multi-turn reasoning and interaction.

SocialBench: Evaluating Social Intelligence of Large Language Models

20 Mar 2024, https://arxiv.org/pdf/2403.13679

This paper introduces SocialBench, a benchmark to evaluate the social intelligence of LLMs using role-playing conversational agents. The benchmark assesses individual and group-level interactions.

The authors find that while role-playing agents show satisfactory performance at the individual level, their social interaction capabilities at the group level remain deficient. This suggests that, while progress has been made, LLMs are not yet fully capable of handling the complexities of social dynamics in multi-agent scenarios.

iAgents: Simulating Human-Like Interactive Agents Using Large Language Models

21 Jun, 2024, https://arxiv.org/pdf/2406.14928

This paper introduces the concept of iAgents, which are designed to address information asymmetry in multi-agent systems. It focuses on agents proactively exchanging information to complete collaborative tasks on behalf of multiple users.

The authors demonstrate the potential of iAgents in complex social networks but also reveal that current LLMs still face challenges in solving cooperation problems under information asymmetry. They highlight that while agents can achieve some success in information exchange, significant challenges remain, especially with more complex tasks and larger social networks.