[Framework Review] Hugging Face's smolagents Framework
Reviewing Hugging Face's SmolAgents Framework, what view of agency it has, what is interesting about it and where it might go next.
At the very end of 2024 (literally Dec 31st) Hugging Face released smolagents - “a simple library to build AI agents”. As is quite obvious from the name this is all about simplicity. Given that Hugging Face already had a library to support “agentic” work (Transformer Agents), it’s interesting to dive into this library, figure out what their motivations are, what it says about what AI Agents are and when you might want to consider using it.
Why was smolagents created?
This is Hugging Face’s successor to Transformer Agents, a library that was getting updates all through 2024. I assume that the learnings from that library led the HF team to feel the need to start fresh. It looks like smolagents also captures HF’s convinction that having the LLMs write the code to execute functions, rather than generating the JSON that then is then used to execute function , is the way to go.
The following seem to be the high-level reasons:
Re-focus on simplicity: The logic for agents is contained within a relatively small amount of code, aiming to keep it easily understandable.
First-class support for Code Agents: smolagents prioritises agents that write their actions in code. Code is seen as the most effective way to express actions a computer can perform, offering advantages over JSON-like snippets in terms of composability, object management, generality and its presence in LLM training data.
Hub Integrations: smolagents allows for sharing and loading tools to and from the Hugging Face Hub, and it is expected that this functionality will expand.
Support for any LLM: It supports various models, including those hosted on the Hub, as well as models from providers such as OpenAI and Anthropic.
Agent Design Perspective: smolagents view on Agency and Autonomy
For every framework we analyse at AgentsDecoded we look for a deeper understanding of how AI Agents can be thought of and engineered. Our reference is our own conceptual framework for understanding agency and autonomy that provides a consistent set of terms to fall back on.
It’s interesting to track the evolution of HF’s thinking around agents.
From the May 13th, 2024 blog post announcing Transformer Agents we have agents defined as:
(..) an agent, which is just a program driven by an LLM. The agent is empowered by tools to help it perform actions. When the agent needs a specific skill to solve a particular problem, it relies on an appropriate tool from its toolbox.
A quite simple definition, which in today’s “discource” (i.e. people throwing random definitions on Substack, LinkedIn, X, bluesky, etc) would not quite cut mustard.
Fast forward to Dec 31st, 2024 and the introduction of smolagents, while the broad agent definition remains largely the same:
AI Agents are programs where LLM outputs control the workflow.
it is further refined to create a star system of agent levels. These refinement is indicative of a move in the overall market to tighten up the definition. While six months ago, people where happy to just call anything an agent essentially, now it is more about saying “watch out - not all agents are equal”.
Here's a breakdown of the star ratings from HF:
☆ (Zero Stars): At this level, the LLM output has no impact on the program flow. This is described as a "Simple processor", and an example is the pattern
process_llm_output(llm_response)
.
★ (One Star): Here, the LLM output determines the basic control flow. This is referred to as a "Router", and an example pattern is
if
llm_decision(): path_a() else: path_b()
.
The LLM makes a basic decision that changes which path the program takes.
★★(Two Stars): In this case, the LLM output determines the function execution. This is described as "Tool call", and an example pattern is
run_function(llm_chosen_tool, llm_chosen_args)
.
The LLM chooses which tool (function) to run, and what arguments to provide to it.
★★★ (Three Stars): This level represents a multi-step agent where the LLM output controls iteration and program continuation. The pattern here is:
while llm_should_continue(): execute_next_step()
.
The LLM decides when to continue the loop or move on to the next step based on its observations.
★★★ (Three Stars): This rating also applies when one agentic workflow can start another agentic workflow (not sure why this is not a higher rating but it is not my rating system!), which is described as "Multi-Agent". The example pattern is
if llm_trigger(): execute_agent()
The LLM decides when to trigger another agent.
Analysis
While I can appreciate the pragmatic view of the different levels, things only really start getting interesting at two stars because that is then we start to potentially explicitly represent goals. Using the AgentsDecoded framework, at zero and one stars we are dealing with what we call passive agents. These are software programs that are are ascribed a goal by the user but don’t have any internal representation of that goal. At two stars we can start thinking of self-directed active agents. The software program is assigned a goal and it needs to figure out how to accomplish it. The simple function calling agent has reduced self-direction, it can only request that some other system execute a function while the three star agent has an explicit representation of a goal and can write its own code to effect change in the environment.
I don’t agree that having a single agent write its own code merits the same star level as having a multi-agent system. That is a fundamentally different type of system where issues of trust, shared knowledge and observability become far more important.
Overall, the challenge, from a definition perspective, that smolagents faces is that it ties things too specifically to the how something is done (i.e. how we are accomplishing something in code) and does not take a broader perspective on agents as goal-driven systems.
When to use smolagents (for Builders)
smolagents makes two important bets:
Keep it simple, stupid. Instead of coming with a bunch of opinions about how you should be designing agent-based systems it focusses on the minimum required and leaves the rest to you as a builder. This is a double-edge sword. If you have a good idea of what the rest should be then this is a great starting point. However, if you don’t know what you are missing you may be in for some interesting (and costly) discoveries along the way!
Writing code directly is better than writing JSON instructions. This is a big deal for smolagents. Instead of using JSON or text-based descriptions, the agent writes actions directly in code, typically Python. This is considered a significant advantage over other methods of defining actions.
The latter point is interesting so here’s a breakdown of what this means and why it's interesting:
Advantages of Code-Based Actions:
Representation in Training Data: LLMs are already trained on vast amounts of code, making them better suited to generating actions in code rather than in other formats like JSON.
Composability: Code can be nested and reused, similar to defining functions in programming. This contrasts with JSON, where such nesting and reusability are much harder to implement.
Object Management: Storing the outputs of actions (like an image from an image generator tool) is more easily managed with code. In JSON, this is very difficult.
Generality: Code is designed to express any action a computer can perform. This makes it a highly general and flexible way to define actions.
Contrast with JSON-based Actions: In contrast to code-based actions, other agents like the ReactJsonAgent in the Transformers Agents 2.0 framework, generate actions in JSON format. While JSON is a common way to represent data, smolagents argues that it’s not ideal for representing executable actions due to limitations in the areas described above.
While the arguments are compelling, I don’t think the jury is out on this. As every, as a builder you need to apply nuance and look at your specific context. I can imagine multiple situations where a layer of decoupling between the definition of what to do and the actual execution is useful, especially when dealing with more complex organisational systems. Nevertheless, it is certainly a worthwhile argument to make and explore. To help explore it here is an outlive of some of the disadvantages of this approach.
Disadvantages of Code-Based Actions
Complexity for Less Powerful LLMs: As the smolagent authors note , less powerful LLMs may struggle to generate good code. Xode-based actions rely on the LLM having a strong ability to generate correct and executable code, which may not always be the case, especially with smaller or less capable models. In contrast, JSON structures are simpler to generate, making them more suitable for LLMs that may not be as proficient at coding.
Potential for Errors and Security Risks: While smolagents uses sandboxed environments for security, the generation and execution of code introduce potential risks and error points that are not present when using simpler JSON or text-based actions. If the LLM generates code with errors, it can cause the agent to malfunction. Furthermore, the code could contain unintended or malicious actions if not properly sandboxed and validated.
Increased Cognitive Load on the LLM: Generating code, as opposed to JSON, could place a higher cognitive load on the LLM, which could lead to slower performance, particularly with complex tasks or less capable models.
Debugging and Transparency: While smolagents emphasizes simplicity, debugging issues with code-based actions can be more difficult than debugging JSON-based actions. Errors in the generated code could be harder to diagnose and fix compared to simpler action formats such as JSON, requiring more expertise and effort.
Limited Flexibility for Non-Programmers: While code provides generality, it also requires some degree of coding knowledge. This might make it less accessible for users without programming skills, potentially limiting the usage of code-based agents for non-programmers.
As you can see while having LLMs write and execute the code is exciting it does come at a cost.
What smolagents adds to our understanding of Agent Systems (for Thinkers)
For me the underlying framework is not really interesting from an agent systems perspective and, to be fair, it is not supposed to it! This is smolagents after all, not bigandcomplexagents with lots of opinions. The most interesting aspect is the code execution vs task definition part. If I draw some parallels to the three layers of agency I defined in a previous post, where you have emergent agency (coming from the LLM), structured agency (coming from a control framework) and perceived agency (the actual interface) what smolagents is doing is pushing more of the agency into the emergent part. It is relinquishing control and allowing the LLM to actually write and execute the code.
What smolagents means for organisations (for Product Owners, CxOs)
I have to admit a certain level of bias here. I work with organisations in regulated environments. I just cannot imagine these clients, that already have difficulty accepting basic LLM functionality, signing up to an agent framework where the agents write and execute their own code. In addition, there is a big step from what smolagents currently does to everything you need for an agent framework running a production-level application. As such it is hard for me to see this as anything other than an interesting experimental approach.
Having said that, I can imagine that some organisations with a bigger appetite for risk and their own specific knowledge and opinions around what else they want to build on top of smolagents could adopt this library as a great kernel from which to start.