Agents Decoded

CIAO👋 Core Components: Orchestration — Coordinating Conversations, Interfaces, and Agents

Ronald Ashri — Tue, 31 Mar 2026 12:46:51 GMT

This article is part of a series exploring how would an architectural framework for AI-applications look like. We called it CIAO👋 , capturing four core components: Conversations, Interfaces, Agents and Orchestration. Previous instalments have covered Conversations, Interfaces, and Agents.

With Conversations, Interfaces, and Agents each explored in depth, we turn to the component that connects them. Orchestration often gets treated as implementation wiring, the thing you figure out after the interesting design decisions have been made. That can lead to significant challenges later, when coordination gaps surface in production rather than in design.

Orchestration is a distinct architectural concern with its own requirements, its own design space, and its own failure modes. In the same way that Conversations merit deliberate design rather than defaulting to whatever message structure an API offers, Orchestration merits deliberate analysis rather than emerging as a byproduct of agent and interface decisions.

A working definition:

Orchestration is the runtime coordination that determines what happens next, which component handles it, and how information flows between them.

Two Sources of Orchestration Requirements

Following the same principle that guides the Agents article, namely that we need to understand the problem before reaching for solutions, we can identify two distinct sources of orchestration requirements.

Domain-derived requirements are constraints that exist in the problem space before any architecture decisions are made. They are discovered through domain analysis, not created through design. These include:

Sequencing constraints: dependencies inherent in the domain (fraud assessment must precede payout authorisation)
Consistency requirements: how aligned shared state must be across participants
Accountability chains: traceability of decisions to responsible parties
Human involvement requirements: mandated checkpoints and approvals
Temporal constraints: how fast coordination itself must happen
Recovery requirements: what must happen when something fails
Observability requirements: what must be visible, when, and to whom
Cross-component policy coherence: how tightly C, I, and A policies must align

These are the non-negotiable elements (unless one decides to reach for core organisational re-design that is!). They set the floor for any orchestration approach.

Architecture-induced requirements emerge from design decisions made across Conversations, Interfaces, and Agents. Decomposing work across multiple agents creates handoff and routing requirements. Supporting multiple interface modalities creates translation and synchronisation needs. Structuring conversations with explicit context introduces state management overhead. None of these coordination challenges exist in the problem domain. They are consequences of architectural choices.

This separation gives architects a useful design heuristic: if architecture-induced orchestration requirements grow disproportionate to domain-derived ones, the architecture may be more complex than the problem demands.

Making this concrete

Consider an insurance claims processing application. The domain-derived orchestration requirements are substantial and exist regardless of how you build the system. Fraud assessment must complete before payout authorisation. The regulator requires a decision audit trail. Human approval is mandated above a monetary threshold. Certain document types must be verified before assessment can begin.

Now layer in architectural decisions. You choose three specialised agents: one for document intake, one for assessment, one for payout. You support both voice and text interfaces for the claimant. You maintain separate conversation threads for the customer and the adjuster.

Each of these decisions creates additional orchestration requirements. The three agents need handoff protocols. The two interface modalities need synchronisation so that a claimant who starts on the phone and switches to a web form does not lose context. The separate conversation threads need state coordination so the adjuster’s view reflects what the customer has provided.

The domain-derived requirements are the cost of the problem. The architecture-induced requirements are the cost of the solution. Both are real, but only the second category is within your direct control to reduce.

A Taxonomy of Orchestration Approaches

With requirements understood, we can examine the design space for meeting them. The primary dimension is locus of control, what decides what happens next.

Zero orchestration applies when there is a single prompt, a single model call, and a single response. In essence, there is almost nothing to coordinate. This is appropriate when domain-derived requirements are minimal and the application involves one agent, one interface, and one conversation. A document Q&A tool that takes a question and returns an answer is a good example. There is no sequencing, no handoff, no state to manage across components. If your requirements genuinely fit here, adding orchestration adds complexity without value.

Model-intrinsic orchestration lets the model itself reason about task decomposition, sequencing, tool selection, and delegation. The model creates its own coordination logic at inference time. This is powerful and flexible, but the coordination reasoning is embedded in model behaviour. That has implications: observability is harder because the model’s routing decisions are not declared in advance. Accountability is harder because the path through the system is emergent rather than defined.

Framework-imposed orchestration defines coordination structures externally. State machines, directed graphs, explicit routing rules. Agents operate within defined nodes. This offers predictability and auditability. You can inspect the coordination logic without running the system. You can verify that compliance requirements are met structurally rather than hoping the model respects them. The cost is upfront design effort and reduced flexibility.

Hybrid orchestration combines the two. External structure defines phases, boundaries, and guardrails. Within those boundaries, model-intrinsic reasoning handles coordination. The insurance claims example might use framework-imposed orchestration for the overall process flow (intake → assessment → approval → payout) while allowing model-intrinsic reasoning within the assessment phase to determine which checks to run and in what order.

Two secondary dimensions further characterise the design space.

Static vs dynamic topology: is the coordination pattern fixed at design time or can it adapt at runtime?

Centralised vs distributed: is there a single orchestrating component or do multiple components negotiate coordination among themselves?

The right position on these dimensions is driven by requirements. High accountability, observability, and temporal constraints favour more explicit, centralised, and static approaches. Lower requirements on those dimensions create room for model-intrinsic, distributed, and dynamic coordination. Now, keep in mind that this is not a maturity ladder. Zero orchestration should not be considered primitive and hybrid is not necessarily advanced. Each fits different requirement profiles.

Orchestration, Governance, and Policies

Three concepts that are closely related but architecturally distinct.

Orchestration is operational. It is the runtime mechanism that routes work, manages state, sequences steps, and handles failures.

Governance is normative. It defines the boundaries within which orchestration operates: what is permissible, who is accountable, how compliance is verified. Governance should remain stable even when orchestration implementations change. You should be able to rework your coordination approach without renegotiating your governance model. If you cannot, the two are too tightly coupled.

Policies operate at multiple levels within a CIAO application. Conversation policies govern dialogue interaction, who can speak, what can be said, what actions are permitted. These are owned and enforced by the Conversation component. Agent constraints define autonomy boundaries and permitted actions, owned by each Agent. Orchestration policies govern coordination behaviour, routing rules, retry strategies, escalation triggers, owned by the Orchestration component. Domain policies capture business rules and regulatory requirements, defined through governance and enforced across components.

The architectural principle: each component enforces its own policies, but orchestration needs visibility into the policies of other components to coordinate coherently. It cannot route a task to an agent whose autonomy level does not permit that kind of decision. It cannot initiate a conversation that violates a conversation policy. Governance is the layer that ensures consistency across these policy boundaries.

Orchestration’s Relationship to C, I, and A

Orchestration has a bidirectional relationship with each of the other three components, and this is what distinguishes it within the framework.

Conversations generate events and state changes that inform orchestration decisions. A customer providing a key piece of information in conversation might trigger orchestration to route that information to an assessment agent. Orchestration also routes information into and across conversations, and may initiate or close conversations based on workflow needs.

Interfaces produce turns that orchestration must process and route. Orchestration determines what gets rendered back and through which interface, particularly when an application supports multiple modalities. A voice interface and a text interface presenting the same conversation require orchestration to manage the translation and synchronisation.

Agents receive work assignments through orchestration and return results. But the relationship is not purely top-down. Agents’ autonomy levels, capabilities, and constraints shape what orchestration can ask of them. An agent with limited autonomy cannot be assigned open-ended reasoning tasks. An agent with high autonomy may complete work without checking back, which orchestration needs to account for.

This bidirectional dependency across all three components is what makes orchestration the connective tissue of a CIAO application, and why its requirements emerge from two sources rather than one.

For Builders

When designing orchestration for an AI-powered application:

Start by mapping domain-derived orchestration requirements. These are your non-negotiable constraints. They exist before you write a line of code and they will outlast any specific implementation.

Track architecture-induced requirements as you design. As you make decisions across Conversations, Interfaces, and Agents, note the coordination overhead each decision creates. Ask whether it is proportionate to the problem you are solving.

Let your accountability and observability requirements guide locus of control. If regulators need to audit decision paths, framework-imposed or hybrid orchestration is likely necessary. If you are building an internal tool with low stakes and high tolerance for variability, model-intrinsic orchestration may be sufficient.

Be explicit about which policies live at which layer. Ambiguity about who enforces what leads to gaps, usually discovered in production. Document the policy boundaries between Conversations, Agents, and Orchestration.

Test governance independence. Can you change your orchestration implementation without changing your governance model? If you cannot, the two may be too tightly coupled. The goal is operational flexibility within normative stability.

Orchestration requirements come from two sources: the problem domain and the architecture you choose. Keeping that distinction clear is the most useful thing an architect can do early. If your coordination overhead is primarily architecture-induced, that is a signal to simplify. If it is primarily domain-derived, the problem genuinely requires sophisticated orchestration and you should invest accordingly.

The right amount of orchestration is the minimum that satisfies both sources of requirements. No less, because the domain constraints are non-negotiable. No more, because every additional coordination mechanism is complexity you must maintain, debug, and explain.

CIAO👋 Core Components: Agents - How to Design AI Agents by Starting with the Problem, Not the Technology

Ronald Ashri — Wed, 06 Aug 2025 07:18:32 GMT

CIAO👋 is a unifying architectural framework for AI-powered applications that organizes systems into four core components: Conversations (managing dialogue state), Interfaces (handling human interaction), Agents (automated decision-makers), and Orchestration (coordinating everything).

The framework prioritizes "what" applications need to do (like understanding natural language or reasoning about actions) over "how" they're implemented (specific technologies like LLMs), allowing systems to adapt as technologies evolve.

It's designed for the new generation of software that is conversation-centric, handles multiple input/output types, and features increasing levels of automated decision-making.

Check out the introduction, our deep dive into Conversations, Interfaces or read on here for Agents.

Introduction

The explosion of interest in AI agents has created an unexpected problem. In the rush to build "agentic" systems, we've adopted the terminology without developing the structured thinking to go with it. Every AI-powered application is now described as having "agents," but ask ten developers what makes something an agent and you'll get eleven different answers.

This rapid adoption has led to confusion that slows down development and complicates communication. Teams struggle to articulate what their agents actually do versus what they need them to do. Architecture decisions are made based on hype or FOMO rather than systematic analysis.

We're building sophisticated systems on shaky conceptual foundations.

Agents went from a relatively quiet corner of AI research to the forefront of how we describe LLM-powered applications in less than two years. But in our eagerness to build, we've skipped the crucial step of developing a coherent framework for thinking about them. In this article, we continue fixing that by establishing a structured approach to understanding Agents as a core component of your AI-powered application.

Why is the term agent useful?

Let’s start here. Why do we even need to use this term? We’ve been describing applications (automated or not) without referring to agents for decades. What’s changed?

For me the core utility comes from having a better abstraction to describe your application. Using the right abstraction brings clarity, leads to more robust designs, improves communication across the team and speeds up development. We are clearly building a new type of software, with levels of automony that were not possible before so using a new abstraction is entirely sensible.

The evolution of software programming abstractions from procedural to agent-based programming.

Of course, it is not enough to just say we are using the abstraction of agents. We then need to build up a set of concepts and the relationships between those concepts that give us a coherent way of describing and comparing a range of different solutions.

As we mention in the intro a core tenet of 👋CIAO is that we clearly distinguish the what from the how. We need to distinguish the problem space from the solution space. The problem space is defined by what you are trying to solve. The solution space is defined by the various options in terms of how you are trying to solve it.

Let’s get going!

Designing agents starting from the problem space

Separating what must be solved from how we choose to solve it remains one of the simplest, most useful lenses for AI system design. By first cataloguing the agent-independent properties of a task and its environment, we can judge inherent difficulty before deciding whether to deploy a rules engine, a model-based planner, or a large language model. Conversely, listing agent-dependent properties clarifies exactly which architectural knobs we can turn to meet those demands.

Agent-Independent (Problem-Space) Dimensions

These characteristics exist before any particular agent shows up:

Task / Goal Characteristics – What is the scope, the domain, and the baseline complexity?
- Example: route-planning for a delivery van versus playing chess.
Goal Verifiability – Can success be measured objectively, or does it rely on subjective judgment?
- Check-mate is clear-cut; “best vacation plan” is not.
Environment Properties – How observable is the world? Is it deterministic or stochastic? Static or dynamic? Discrete or continuous?
- A chess board is fully observable and static; a stock market is partially observable and stochastic.
Multi-Agent Context – Are we in a multi-agent setting? Is that setting cooperative, competitive, or mixed setting?
- Self-driving fleets cooperate; high-frequency traders compete.
Baseline Temporal Constraints – What deadlines does the external world impose?
- The exchange’s one-second trading tick leaves little room for deliberation.
Intrinsic Safety & Alignment Risk – How much harm could failure cause?
- Incorrect medical-dose planning carries far more risk than mis-classifying a meme.
Required Steps/Plan Complexity – Does achieving the goal follow a fixed sequence, or require dynamic adaptation?
- Assembling furniture follows instructions; navigating rush-hour traffic requires constant replanning.

These dimensions help us understand the fundamental difficulty of a problem before we even consider what kind of agent architecture to deploy. A task that scores high on stochasticity, has low goal verifiability, operates in a competitive multi-agent environment, and carries significant safety risks presents inherent challenges that no agent architecture can simply wish away.

Agent-Dependent (Solution-Space) Dimensions

Once we understand the problem space, we can examine the architectural choices that determine how well an agent might handle it:

Learning Capabilities – Can the agent improve through experience, or is its behavior fixed?

From simple parameter tuning to sophisticated meta-learning and continual learning without forgetting.

Autonomy Level – How much human oversight does the agent require?

Following thinking similar to the SAE levels (0-5), we can reason about the level of autonomy from human-controlled with agent assistance to fully autonomous operation.

Planning Architecture – How does the agent decide what to do?

Reactive (stimulus-response), deliberative (model-based planning), or hybrid approaches.
The choice often depends on temporal constraints from the problem space.

Coordination Capabilities – Can the agent work with others?

Communication protocols, negotiation strategies, and team formation abilities.
Critical when the problem space includes multi-agent contexts.

Computational Architecture – What resources does the agent need?

Edge devices with millisecond response but limited compute, or cloud-based with powerful models but network latency.
Memory requirements, energy consumption, and bandwidth constraints.

Perception and Action Capabilities – What can the agent sense and do?

Sensor modalities (text, vision, audio, proprioceptive).
Action space (discrete choices, continuous control, communication acts).
Internal representations (symbolic, neural, hybrid).

Temporal Processing – How does the agent handle time?

Real-time responsiveness versus batch processing.
Planning horizons from immediate reactions to long-term strategic thinking.
Memory architecture for maintaining context over extended interactions.

Bringing it together: Matching solutions to problems

The power of this separation becomes clear when designing systems. Let's examine two contrasting examples:

Example 1: Recipe Recommendation Agent

Problem Space Analysis:

Goal verifiability: Low (taste is subjective, success depends on user satisfaction)
Environment: Mostly observable, deterministic (ingredient availability, dietary restrictions are known)
Safety risk: Low (worst case: unenjoyable meal, unless handling allergies)
Temporal constraints: Relaxed (users can wait 10-30 seconds for suggestions)
Multi-agent context: None (single user interaction)
Plan complexity: Moderate (matching ingredients, techniques, time constraints)

Solution Space Decisions:

Learning: Basic preference learning from user feedback
Autonomy: Level 4 (agent suggests, user can modify)
Planning: Simple deliberative (filter → rank → present options)
Computation: Can run entirely on edge devices
Perception: Text input, possibly image recognition for ingredients
Coordination: Not needed for basic version

This is a forgiving problem space where we can experiment with higher autonomy and simpler architectures. The low stakes allow for learning through trial and error.

Example 2: Medical Diagnosis Agent

Problem Space Analysis:

Goal verifiability: Mixed (some conditions have clear tests, others require judgment)
Environment: Partially observable, stochastic (hidden symptoms, disease progression)
Safety risk: High (misdiagnosis can be fatal)
Temporal constraints: Moderate (emergency vs. routine care)
Multi-agent context: Cooperative (must work with healthcare team)
Plan complexity: High (differential diagnosis, test ordering, treatment planning)

Solution Space Decisions:

Learning: Continual learning to incorporate new medical knowledge
Autonomy: Level 2-3 (agent assists, human decides)
Planning: Hybrid (quick triage reactions + deliberative diagnosis)
Computation: Cloud-based for complex reasoning, edge for emergency triage
Perception: Multi-modal (text, images, lab results, vital signs)
Coordination: Must integrate with electronic health records, other clinical systems

The contrast is striking. The recipe recommender can be playful and experimental, while the medical agent demands conservative design choices, extensive validation, and human oversight. This structured approach prevents both over-engineering (building unnecessary capabilities) and under-engineering (deploying inadequate agents). It makes trade-offs explicit: accepting lower autonomy to manage safety risks, or investing in expensive computational resources to meet temporal constraints.

Looking ahead: The evolving landscape of agents

As we build more sophisticated AI-powered applications, the agent abstraction becomes increasingly valuable. But its value lies not in the buzzword itself, but in the structured thinking it enables. By clearly separating what must be solved from how we choose to solve it, we can:

Communicate clearly across teams – product managers can specify problem-space requirements while engineers explore solution-space options
Evaluate systematically – distinguishing between difficult problems and poorly-designed agents
Evolve gracefully – as new technologies emerge (better models, new architectures), we can swap solution approaches while keeping problem definitions stable
Design responsibly – explicitly considering safety and alignment risks before they become afterthoughts

The 👋CIAO framework's emphasis on Agents as one of four core components (alongside Conversations, Interfaces, and Orchestration) reflects a fundamental shift in how we build software. No longer are we just coding procedures or training models – we're designing entities with increasing autonomy that must navigate complex, uncertain environments while maintaining alignment with human values and goals.

This shift requires new abstractions, new frameworks, and new ways of thinking. The agent-independent versus agent-dependent distinction is just one tool in this evolving toolkit, but it's a powerful one. It forces us to be honest about what we're asking our systems to do, and thoughtful about how we equip them to do it.

As you design your next AI-powered application, start with the problem space. Understand the inherent challenges before reaching for solutions. Then, and only then, explore the solution space with clear eyes about what capabilities your agents truly need. This discipline – separating the what from the how – will give you a head start in building the right sort of solution for the problem at hand.

CIAO👋 Core Components: Interfaces

Ronald Ashri — Mon, 07 Jul 2025 16:32:20 GMT

It's designed for the new generation of software that is conversation-centric, handles multiple input/output types, and features increasing levels of automated decision-making.

Check out the introduction, our deep dive into Conversations or read on here for Interfaces.

LLMs as a technology may eventually be superseded. It is entirely possible that we will come up with new ways to “understand” language and simulate reasoning. What, however, is unlikely to go away, as it has survived millennia of human interaction, are conversations as a central part of digital interfaces and the main way we interact with machines. In that aspect, the flip has switched, the expectation is set, there is no going back. Once someone has experienced the ease with which they can just ask their question in ChatGPT or Claude how can you remove that capability?

As such, when we think of our Interfaces within the context of CIAO 👋, the starting point is undoubtedly the conversational interface.

Chat - the ubiquitous interface for LLM-powered assistants

It starts with text, but quickly evolves

Now, if the basic interaction mode is one where the user asks a question, things have quickly moved on. Users can now interact across different modalities and in ways that are not always even that clear.

We can type out the question, speak it, provide attachments by uploading files and if we break out of the screen we can think of gestures and other types of ambient signals (location, orientation, expression and so on) - all of which contribute to sending a message to the software and leaving it with the job of interpreting intent.

Similarly, the response from the interface can start with a text response but can quickly evolve from there to include a variety of signals and artifacts that encapsulate, on one form or another, the answer to our query. Recently, Anthropic has doubled down on the idea of the artifact - an output of the main conversation that sits apart from the main conversation and can move to front-stage or take a back-seat depending on context. An artifact could be a longer piece of text that will be iterated on, a piece of code or even a full-blown application.

Ultimately what we have is a very fluid environment, that will adapt depending on context. When we are architecting applications with this very fluid concept of interface the challenge becomes identifying what we can hang on to. What remains constant irrespective of the specific interface and we can use an abstraction within our architecture to represent any type of interface?

The answer lies not in the interface itself, but in what flows through it: the conversational turn.

The Universal Pattern

Think about it - whether you're typing a question, speaking to your device, uploading a file, or even making a gesture, you're fundamentally doing the same thing: taking a turn in a conversation. This pattern has survived millennia of human interaction because it represents something fundamental about how we exchange meaning.

In the context of CIAO👋, this abstraction sits at the intersection of Conversations and Interfaces, serving as the bridge between them. It's the atomic unit of interaction that remains consistent regardless of how it's expressed or received.

Anatomy of a Conversational Turn

What makes up this abstraction? At its core, every turn in a conversation - regardless of interface - contains:

Content: The multimodal payload itself. This could be text, audio, visual data, files, or even ambient signals like location or device orientation. The abstraction doesn't care about the specific format - it just needs to carry the content.

Intent: The interpreted purpose behind the interaction. Whether spoken, typed, or gestured, there's always an underlying intent that needs to be captured and understood.

Context Reference: Every turn exists within a conversational flow. It references what came before and influences what comes after. This context travels with each exchange, maintaining coherence across modality switches.

Metadata: The supporting information that helps interpret the turn - timestamp, participant ID, modality type, device context, and other environmental factors that might influence interpretation.

Artifacts: As Anthropic has demonstrated, responses increasingly include outputs that transcend simple replies - code, documents, applications. These artifacts are part of the turn but have their own lifecycle and interaction patterns.

Why This Abstraction Matters

This approach aligns perfectly with CIAO's principle of separating "what" from "how". The "what" is the fundamental need to exchange meaningful information between participants. The "how" is the specific interface implementation.

Consider a practical example: You start by typing a question about data analysis. The system responds with text and a visualization artifact. You then speak a follow-up question while pointing at a specific part of the chart on your screen. Finally, you upload a CSV file with additional data.

Each of these interactions uses a different interface modality, but they're all turns in the same conversation. The abstraction allows the system to:

Process each turn uniformly, regardless of input method
Maintain conversation coherence as you switch between interfaces
Enable agents to focus on intent and content rather than interface specifics
Allow orchestration to manage the flow without being coupled to interface details

Implementation Considerations

In practice, this abstraction becomes a contract between Interfaces and Conversations. Interfaces are responsible for:

Capturing raw input in whatever form it takes
Packaging it into a standardized turn structure
Enriching it with relevant metadata
Passing it to the Conversation component

The Conversation component then:

Maintains the dialogue state
Routes turns to appropriate Agents via Orchestration
Manages the overall flow and context
Returns responses that Interfaces can render appropriately

This separation means we can add new interface types without changing the core conversation logic. Want to add a brain-computer interface? As long as it can package thoughts into turns, the rest of the system doesn't need to know or care.

The Evolutionary Path

This abstraction also gives us a framework for thinking about interface evolution. Early systems might support simple text turns. As they mature, they add:

Multimodal input processing
Artifact generation and manipulation
Ambient context awareness
Collaborative features with multiple participants

Each evolution extends the turn abstraction rather than replacing it. The conversation remains the constant, even as interfaces become more sophisticated.

For Builders

When designing AI-powered applications using CIAO👋, start with the turn abstraction. Ask yourself:

What constitutes a turn in my application's context?
What metadata do I need to capture for proper interpretation?
How will artifacts flow through the conversation?
What happens when users switch interfaces mid-conversation?

By grounding your architecture in this abstraction, you create systems that can adapt to new interfaces while maintaining consistency and coherence. The specific technologies will evolve - LLMs may be superseded, new interaction paradigms will emerge - but the conversational turn will remain.

After all, as we said at the start, conversations have survived millennia of human interaction. By making the turn our architectural constant, we align our software with this fundamental human pattern. The interface may be fluid, but the conversation endures.

Understanding AI Agents through the Generality - Accuracy - Simplicity (GAS) Framework

Ronald Ashri — Fri, 04 Jul 2025 07:50:14 GMT

In June 2025, Hasan et al published a really interesting and useful paper called "From Model Design to Organizational Design: Complexity Redistribution and Trade-Offs in Generative AI". In that paper, they present the GAS (Generality - Accuracy - Simplicity) framework as a way to better understand how LLMs are impacting organizations and competitive strategy.

The GAS framework posits that there is an inherent trade-off in model design: "No model can perfectly replicate the phenomena it represents. Instead, effective models must balance competing priorities across three dimensions: generality, accuracy, and simplicity". Optimising one or two dimensions necessitates compromises in the third.

Generality: A model’s ability to perform across diverse contexts, domains, or tasks. For example, a general-purpose LLM can translate policy documents, generate code, summarise wikis without requiring new syntactic knowledge.

Accuracy: This denotes the degree to which outputs align with empirical observations or observable reality. This includes the correctness of information, the reliability of results, and the precision of outputs. For example, a fine-tuned model is sacrificing generality in the pursuit of increased accuracy in a specific domain.

Simplicity: The effort required for users to understand, apply, or interact with the model, and its structural complexity. The typical chat interface to an LLM provides the end-user with a high-degree of simplicity but the application is hiding or abstracting away the underlying complexity required to run and maintain the relevant LLM infrastructure to support inference.

Crucially, organisations "cannot maximize these dimensions simultaneously" and thus model design "involves deliberate trade-offs among them." For example, "gains in accuracy may reduce either generality or simplicity." As the figure below illustrates high-stakes, domain-specific work favors Accuracy over Generality and so on.

From the paper: “From Model Design to Organizational Design: Complexity Redistribution and Trade-Offs in Generative AI”

Reducing Costs and Redistributing Complexity

Focussing on the simplicity dimension, the authors explain how in reality what we are doing through the introduction of AI and LLMs is not a simple reduction of costs. While we are reducing the cognitive costs of performing tasks for end users (i.e. introducing simplicity on the front-end) we are redistributing complexity into layers downstream. We are increasing governance complexity, infrastructure complexity and so on.

They make reference to Tesler's Law to support their argument. Formulated in the 1980s by computer scientist Larry Tesler, it states that for any system, there is an inherent amount of complexity that cannot be eliminated, only redistributed. The critical question is who must deal with it: the user, or the designers and developers.

Understanding GAS in the context of AI Agents

While the paper focusses on the broad application of LLMs in organizational contexts it is also a useful lens through which to consider the design of a specific AI Agent application.

You can’t have it all

The GAS framework provides the terminology to describe more formally what a lot of engineers and designers understand instinctively. You simply can’t have it all. Every orchestrator that promises to dynamically coordinate between capabilities, every agent framework that claims to handle any business process, every vendor pitch about "unlimited agency" - they're all dancing around the same uncomfortable truth. You can’t have it all.

Perceived vs Actual agency

Another way to apply the GAS framework to refine our AI Agent design is by considering what capabilities we are communicating to the user and in what ways we are constraining the overall system at different layers of the application.

In a previous article on AgentsDecoded we talked about the role and perception of agency across multiple layers of our application. The user interface communicates a certain level agency (or Generality) and ease of access (Simplicity) to our system, the control layer is where we can manage agency (typically in pursuit of Accuracy) and also introduce complexity, while the LLM is whether there is a less controllable emergent agency.

The three layers of agency

The goal is to align the different layers. If your user interface promises the moon (high Generality and Simplicity) but your control layer can't deliver the precision (Accuracy) without an army of engineers maintaining it (complexity redistribution), you are heading for trouble.

So What Should You Actually Do?

For a successful AI Agent deployment you need to pick a clear spot on the GAS triangle and own it.

High Accuracy, Low Generality: Build domain-specific agents that focus on delivering tightly defined use cases reliably. You will need different agents for different processes but that is an acceptable cost of complexity to pay.
High Generality, Lower Accuracy: Accept that your general-purpose agent will make mistakes. Build in human oversight. Make peace with the 80% solution.
High Simplicity: Keep the front-end simple but be prepared to pay for it with backend complexity. Budget for it. Staff for it. Plan for it.

The organisations that will win with AI Agents are those that understand these trade-offs and design around them, not those that pretend they don't exist.

There is no free lunch

Every "breakthrough" in agent capabilities is really just shifting complexity around. That new orchestration framework that promises to solve all your problems? It's moving the complexity from your business logic into your infrastructure. That retrieval-augmented agent that never hallucinates? Congratulations, you've just signed up to maintain a knowledge base that needs constant updates.

The GAS framework isn't telling us something we don't already instinctively know. It's giving us the language to articulate why that demo that looked so good is now causing so many headaches in production.

Ultimately, the future of AI Agents isn't about transcending the GAS trade-offs. It's about getting increasingly better at managing them.

You're Building AI Agents Backwards

Ronald Ashri — Thu, 26 Jun 2025 06:50:01 GMT

Last Tuesday, I sat in a meeting that was supposed to be about automating customer onboarding. Fifteen minutes in, we were deep in a heated debate about whether to use GPT-4 or Claude, which vector database had the best recall rates, and what was the best framework.

The actual problem: “customers were waiting three days for account activation” somehow was forgotten. We'd lost the forest for the trees, or more accurately, lost the customer for the embeddings.

This scene plays out in conference rooms (real or virtual) everywhere, and it succinctly captures one of the biggest misunderstandings around where is the value of an agent-based approach.

We tend to think of AI Agents as a specific technology. We then realise that it is actually not clear what this technology is. At least with LLMs we know whether we are using one or not. With agents when is it that we are “doing it right”? As a result we worry whether we are building “real” agents or are just writing software. How do we convince ourselves that the agents are “real”? Well, we end up making the technology more complex than it needs to be. We over engineer in a quest to be credible, not because the problem requires it.

The way out of it? Think of agents are an abstraction in service of a strategy first, and a technology second. This isn't just semantic gymnastics. It's the key to cutting through the noise and actually delivering value.

The Power of the Right Abstraction

Abstractions are incredibly powerful tools and we use them extensively in everything we do. Think about it this way. When object-oriented programming emerged, its power wasn't in any specific language feature. It was in providing a new way to think about organizing code – encapsulating data and behavior together. The abstraction came first; the implementations followed.

AI Agents offer the same conceptual breakthrough for automation and decision-making. They give us a framework for thinking about:

Encapsulation: We can package complex reasoning and decision-making into discrete entities, hiding the complexity while exposing the actions. Unlike traditional objects that encapsulate data and methods, agents encapsulate goals and capabilities.

Goal-Directed Behavior: Instead of programming specific behaviors, we define desired outcomes. "Keep customer satisfaction above 90%" rather than "If customer says X, respond with Y."

Environmental Awareness: Agents sense and act within their environment, whether that's reading emails, querying databases, or interacting with other systems.

Autonomy Spectrum: We can dial autonomy up or down based on the problem. Some agents need tight guardrails; others can be given broad discretion.

This abstraction is powerful because it lets us design systems that mirror how we naturally think about delegation. When you delegate work to a human colleague, you don't specify every neuron firing – you describe goals, constraints, and available resources. Agent abstractions let us do the same with machines.

Why This Matters Now More Than Ever

The current AI landscape is drowning in technical specifications. Every vendor wants to tell you about their proprietary architecture, their unique implementation, their special certification system. They're asking you to care about whether their system qualifies as a "true" agent according to some arbitrary technical checklist.

This is backwards. The question isn't whether your system uses an LLM to control workflow iteration or meets some three-star rating system. The question is: What work can you delegate to it? What problems does it solve? How does it transform your operations?

When you start with agents as abstractions rather than implementations, everything becomes clearer:

Technology becomes flexible: Maybe you use an LLM for natural language understanding but deterministic rules for critical decisions. The agent abstraction accommodates both.
Evolution becomes natural: Start simple, add sophistication as needed. The conceptual model remains stable even as capabilities grow.
Communication becomes universal: "We need an agent to handle tier-1 support tickets" makes sense to everyone. Technical implementation details don't.

From Abstraction to Action

So how do you operationalize this? Start with these questions:

What work and decision-making do you want to delegate? It could be specific task or it could be broader responsibility areas.

What does success look like? Define goals in terms of outcomes, not processes. We will then iteratively craft the technology to become increasingly better at getting to the right outcomes.

What capabilities are needed? What must the agent sense, decide, and act upon? These start forming your technical requirements.

How much autonomy is appropriate? This determines your guardrails, and specific architecture choices once you turn your attention to implementation.

Only after answering these should you even consider implementation specifics. Will you use the latest LLM? Perhaps. Will you need complex reasoning chains? Maybe. But these are tactical decisions that flow from your strategic framework, not the other way around.

That's the power of thinking about agents as abstractions. You're freed from dogma to focus on outcomes.

Looking Forward

Organizations that understand agents as a conceptual framework for automation will pull ahead. Those waiting for the "perfect" technical definition will be left behind.

The companies that win won't have the most technically sophisticated agents. They'll have the clearest thinking about what work to delegate and how to structure that delegation. They'll use agents as a lens for reimagining work, not as a checkbox for technical compliance.

So the next time someone asks whether your system is "really" an agent, try this response: "It autonomously pursues goals using its capabilities to solve real problems. That's all that matters."

Then get back to solving your problem.

How a Simple Question Exposes Claude 4’s Shortcomings: "Claude, can you write a game?"

Ronald Ashri — Sun, 25 May 2025 16:12:19 GMT

I had some time on my hands and wanted to explore some of Claude 4’s new capabilities. So I asked a simple question.

My expectation was that we would have a bit of a back and forth on Claude’s capabilities and then maybe have an attempt at writing something.

Instead, Claude launched into some “thinking” without coming back with any questions about what I meant.

After the “behind-the-scenes” “thinking” it announced:

Bias to action

The initial interpretation of my request “the user is asking me to write a game” is flawed, which is a disapointing in and of itself. I am asking if Claude could write a game, not to write a game. Leaving that aside, Claude then “decides” to dive straight in and even pick the game. This indicates that Anthropic’s guidelines around how to train Claude 4 probably have a bias towards action. The training reinforces less back and forth with the user and more let’s get things done. This is a design choice and one that I think is flawed as it will only lead to misunderstanding and wrong results. Interestingly, asking Claude about the linguistic difference reinforces that this is a built in bias:

OK. Well - I prefer my AI Assistants to be a bit more “technical” in their interpretation but Claude is not necessarily going down that path. Let’s move on to the more worrying aspect.

Limited Contextual Awareness

Claude then proceeds to write a Snake game that didn’t work. Remember, Claude “picked” the game". Worse than that it “lied” to me about how well the game works. Despite all the nice description of gameplay and features that were supposedly there as soon as you clicked on Play Again the red dot changes position and the game was over. That’s it. A cursory glance at the code revealed that there was no actual Snake gameplay code so this was not about some small bug in the code. The code to perform the features that the game that Claude “decided” on its own to write was just not there.

All this pointless effort when a simple: “Sure, I can write a game - what did you have in mind?” would have worked.

An active choice to not fix things

What is worrying about this is not so much that the technology failed. Of course, this is not quite in line with the bold claims of Anthropic and others about the level of intelligence of these models. But we’ve learned to ignore those claims and see how the technology will actually work for ourselves, right?

What is worrying is that we are at a place where it ok for an organisation to release paid software that will behave this badly. Imagine a standard SaaS tool saying: “Yes, I’ve saved your document” - only for the document to not be saved. We would be up in arms!

Shouldn’t Claude be able to check for itself if the game actually works before telling the user that it does work? The technological capability is there. Claude can write the game and then try and play the game. Check for itself if what it is claiming is true. There comes a point where we have to start treating these products not as “experiments”, like the early ChatGPT but things that we are actually paying money for and are being sold to us as amazing technological miracles.

I think we are far enough in this journey that it is ok to simply say that Anthropic (and others) are actively choosing to not fix these issues and instead focus efforts elsewhere. That is simply not ok. It would not be ok in any other class of paid software after two years of maturity and it should not be ok for these assistants.

By the way, I asked the same to ChatGPT. It faired better on the first step - it asked me what I would like, but then also proceeded to create a completely non-functioning snake game and “lie” about it.

Again, this is not about a limitation of the technology. This is a choice by these organisations to release products at this level of reliability.

Super-Assistants Vs The World: Why LLM Wrappers, traditional SaaS tools and Specialist AI Agents all need to worry.

Ronald Ashri — Sat, 24 May 2025 07:48:49 GMT

The arrival of “super-assistants” such as Claude 4 Opus, ChatGPT and its forthcoming upgrades, or Google’s various instances of Gemini give us a sense of where the big tech companies want to go next. In short, they are after everything. They are following a path defined by the broad technological capabilities of large language models and the ambition to “dominate everything” and are building assistants that can connect into every data source and provide an alternative for every UI interface. Through integration protocols like MCP and A2A and misleading terminology such as the “open agentic web” they want access to your data wherever that may be and they offer adaptable UI interfaces that can morph into anything from coding tools to writing assistants to help you get work done while also chatting.

This means that these super-assistants are coming for everyone else.

If your startup is based on the idea of offering a streamlined interface on top of LLM APIs to achieve a specific task such as creating marketing copy, producing a podcast or performing data analysis you have to start wondering what is your real moat.

As a traditional SaaS company you are caught between a rock and a hard place. On the one hand, the “agentic” web means that people expect you to make your data available through protocols such as MCP. However, the minute you do that you are enabling your users to ignore your painstakingly crafted UI. If you are a CRM tool for smaller organisations and Claude 4 can plug into your data through an MCP server and simply answer any question the user may have - what is your actual purpose? How do you differentiate?

Finally, if you are focusing on building a narrow, deeply-tuned vertical agent for coding, legal drafting, medical triage and other domains you must be looking at these larger models and wondering at what point do they simply become as good or better than your painstakingly fine-tuned model. What other moats are there to protect your investment?

In the next few paragraphs I’ll have a brief look at each category and share some thoughts about what I think is likely to happen (as ever usual disclaimers apply for any adventure in futurology!).

The Threat to the LLM Wrappers

Over the past couple of years there has been a constant barrage of announcements from startups of browser extensions, mobile apps or light SaaS that do one thing well by exploiting LLM capability. That might be writing marketing copy, summarising PDFs, managing jobs applications and so on. The backend is typically a concoction of prompts and other instructions stitched together as quickly as possible so that the startup can get to launch and find its audience.

These companies will feel increasing pressure from multiple directions.

Overtaken by a feature of the bigger platform. If the functionality is important enough the big platforms will simply fold the functionality in. Zoom adds a notetaker, Gmail an email writer, etc. The super assistants can also build the missing UI in a more flexible way. Every time Anthropic adds another type of artifact to Claude a startup faces an existential threat.
Priced out. The wrappers are essentially doing a price arbitrage between the cost of tokens (raw material) and the value of the operation (value to user). This is a fragile position to be in for a fast moving environment, especially when your growth model depends on giving an initial taster for free and hoping to make profit on (at best) 10%-20% of your users.
Users empowered to replicate you out. LLMs with coding assistants are like the Star Trek replicators. Any application you need you can simply create it on the spot. What purpose for thin LLM wrappers when a user can describe what they want and the super assistant can replicate it on the spot. Whats more they can write it to fit exactly what the user needs rather than having to negotiate the competing needs of different types of users.

Here are some examples of the products under threat:

- Compose AI: A chrome extension that helps with writing it features email replies on its home page. At the same time Gmail/Workspace “Help Me Write” sidebar does the same in native UI.

- Fireflies / Otter.ai: These tools to meeting transcription & summary but Zoom, Teams and Google Meet all come wiht the same functionality.

- Humata / ChatPDF: These tools specialise as PDF readers, but you can drop a PDF in any of ChatGPT, Claude or Gemini (or NotebookLM) and get the same job done.

- Copy.ai: They offer marketing copy templates but HubSpot & Mailchimp Copilots can now generate campaign copy inside the CRM/email studio.

Now, this is not going to be an immediate shift. Super-assistants are still brittle and a well crafted series of prompts to achieve a task still has value. But the direction of flight is clear.

The Threat to Traditional SaaS

Classic SaaS vendors with CRM, ERP, ticketing, HRIS tools have invested decades of effort in perfecting pixel-heavy dashboards and per-seat pricing. The rise of super-assistants flips that equation: users talk to an AI, the AI hits the API, and the GUI is left feeling rather lonely in the corner.

As a SaaS tool you will be feeling a number of different pressures.

UI bypass. Super-assistants are inviting everyone to enable access to data via MCP, which means you can be interacting with a SaaS tools data but never access the native SaaS screen. As a SaaS provider where do you do your upsells? Where do you learn what features people want?

Migrating pricing logic. When usage is machine-initiated, “$79 per month per human” looks archaic. There is no human, it’s just a single machine accessing all the data. Multiple humans are using that single machine to interact with the data.

API fragility. Legacy throttles, rate limits and brittle authentication crumble under the rapid-fire calls of autonomous agents, making the SaaS product look unreliable even when the fault lies in its ageing plumbing.

Data-layer commoditisation. If the assistant can stitch together Snowflake, Notion and Stripe to achieve the same workflow, monolithic suites risk being unbundled into cheaper point services behind the scenes?

Examples of SaaS incumbents that must be feeling the pressure are:

Lightweight CRMs like Pipedrive - A tool that logs deals, nudges users to follow up with clients is already feeling the pressure of tools like Google Sheets, Airtable that are able to replicate a lot of the functionality. Add to that the flexibility of LLMs writing queries and achieving more complex data analysis and the risk is very real.

HR software for smaller orgs like CharlieHR - Same situation here, we are essentially dealing with a well-crafted data layer and nice UI. I absolutely appreciate the amount of effort and thought that goes into these tools, but it will become increasingly simpler for organisations to replicate / substitute the functionality.

Expense and receipt management software like Expensify - Log expenses, upload images of receipts and prepare reports. The whole “we can read your receipt” uniqueness of such tools disappears when you have LLMs that can do the same.

Essentially any tool that is a think workflow + data layer, together with a UI that requires a lots of clicking and moving around to update and configure data is at risk. When assistants owns the conversation, the underlying system becomes just another API.

Of course not all SaaS tools are equal. The ones at risk are the lightweight solutions that targeted smaller companies and startups through a nice UI and brand but a think underlying workflow layer. Larger organisations or SaaS tools in regulated spaces are going to enjoy protection for some time yet as they benefit from workflow lock-in and regulatory lock-in.

Nevertheless, I think we will inevitably see a bunch of SaaS tools go through a deceleration in growth and eventually start losing users. The combination of ease of creation of custom throwaway solutions, flexible conversational interfaces and AI Agents that can perform tasks on the fly by combining capabilities as required will mean that less people will be on the lookout for the perfect SaaS solution to their problem. They will just spin it up through a super-assistant as and when required.

The Threat to the Specialised Vertical Agents

After the initial excitement subsided following the release of ChatGPT two years ago the limitations started coming through. Not enough specialised data, failing in the more niche domains, no guardrails, hallucination unless careful guardrails are put int place.

This introduced a fertile space for startups to dig in and capture some value. The pitch is simple: we are safer and more accurate than a generic chatbot. I think this space is also under risk but there are at least two moats that give it a much longer lifetime.

Regulatory moats: while the larger technology providers could do the work to keep up with regulations across multiple geographies this requires deep expertise of a domain and understanding of the internal dynamics. Why spend time on this when there are far more immediate opportunities to go after. For example worrying about how to handle insurance claims, where the tolerance for error is close to zero, is far less attractive than going after the checkout cart of every single consumer facing brand.
Data moats: Niche domains tend to not make their data available publicly. If you have access to a data set that the larger providers cannot get their hands on you are sitting on extremely valuable resource right now. Just in the same way that the Wall Street Journal cut a deal with OpenAI for access to stories your specialised agent can become one of the underlying capability providers for bigger assistants.

Nevertheless, the risks are clear as well.

The larger models can keep ingesting more data and creep on the capabilities of specialised agents. You need to pick your domains carefully. For example, there were a lot of initial efforts in the finance domain but now OpenAI and Gemini have become increasingly better at analysing company balance sheets and end of year reports.

We had a flurry of early startups doing notetaking in specialised domains like medicine but now Epic is building these tools in its own software, while Microsoft, Google and OpenAI are all offering solutions.

Startups in this space need to develop a multi-pronged strategy to succeed. The vertical agents need to be extremely clear about how they offer better safety and guardrails when compared to the super-assistants. The bet is that proven compliance can beat generic convenience.

At the same time, the solution offered needs to integrate deeply into the ecosystem reducing the time to value for a customer. If the functionality is reduced to a single API call that is also easy to switch out. If, instead the functionality is a combination of deep, specialised backend capability coupled with flexible conversational front-end interfaces that customers can drop into their existing systems in the form of co-pilots to enhance functionality the stickiness is greater.

Finally, the company needs to build a broad base of partnerships in the ecosystem of choice, integrating itself deeply with the software that operates in the space such as policy administration systems for insurance or healthcare record management systems and the people and organisations that exist in the space.

What happens next

Super-assistants will not flatten the software landscape instantly, but they will shift its centre of gravity. The transformation won't follow a neat sequential pattern I might have been implying with LLM wrappers falling first, then small SaaS providers and finally vertical agents. We’re probably heading toward a more chaotic, bifurcated landscape where disruption and adaptation happen simultaneously across different market segments.

Fragmentation. Rather than super-assistants cleanly displacing categories of software, we'll see the market split along entirely new fault lines. Enterprise customers, spooked by data governance concerns and burned by AI hallucinations in business-critical processes, may initially retreat toward more controlled, specialized solutions. Meanwhile, SMBs and individual users will rapidly embrace super assistants for their speed and flexibility.

Traditional SaaS companies fight back. The smart ones won't wait to be disrupted, rather they'll cannibalize themselves first. Salesforce, HubSpot, and others are aggressively rebuilding their platforms around AI-first architectures, using their data advantages and customer relationships to create hybrid experiences that combine conversational ease with visual power. The losers will be those who try to protect their existing UI investments rather than reimagining them.

LLM wrappers evolve into "AI-native platforms." Instead of being squeezed out, the most agile wrapper companies will pivot to become the integration layer between super-assistants and legacy systems. They'll survive by solving the "last mile" problems that general-purpose AI can't handle. Weird edge cases, industry-specific workflows, and compliance requirements that emerge when AI meets messy real-world business processes.

Vertical agents become the new middleware. Rather than retreating to narrow niches, successful vertical agents will position themselves as specialized reasoning engines that super-assistants call upon. A legal AI won't compete with Claude for general tasks; instead, Claude will route complex legal queries to specialized agents that provide verified, compliant responses. This creates a new B2B2C model where vertical agents become invisible infrastructure.

The reliability wars intensify. As AI becomes business-critical, a new category of "AI reliability infrastructure" will emerge. Companies will pay premium prices for tools that provide consistency, auditability, and error handling around AI interactions. The winners won't necessarily be the smartest AI—they'll be the most dependable.

Geographic and regulatory balkanization. Different regions will evolve different AI ecosystems based on local regulations, cultural preferences, and data sovereignty requirements. European companies may gravitate toward locally-hosted, transparent AI systems, while others prioritize raw capability regardless of provenance.

The integration complexity explosion. As every software tool races to add AI capabilities, the challenge shifts from "can AI do this task?" to "how do we manage 47 different AI agents that all want to help with the same workflow?" New categories of AI orchestration and conflict resolution tools will emerge.

The timeline is compressed at the edges but extended in the middle. Power users and early adopters will experience dramatic workflow changes in the next 2 years, while large enterprises may take many years to fully transition. This creates sustained opportunities for companies that can bridge both worlds.

Rather than a winner-take-all scenario, we're more likely to see a complex ecosystem where super-assistants serve as the user-facing layer, but rely on a vast network of specialized capabilities, data sources, and compliance tools operating behind the scenes. Along the way there are multiple questions that we need to answer about concentration of power, consequences to society and real life and “collateral” cost, as AI undoubtedly displaces current ways of doing things.

CIAO Core Components: Conversations

Ronald Ashri — Tue, 20 May 2025 07:28:40 GMT

There is a new breed of applications that are conversation centric (depending mostly on LLMs to achieve this), can have components that make decisions and take actions independently (Agents) and can interact with the user across a variety of modalities (Interfaces). The CIAO (Conversation - Interface - Agents - Orchestration) framework is an attempt to structure our reasoning about these applications in a coherent and consistent way. We introduced CIAO here, and will continue exploring it over a series of articles.

Conversations

While some AI-powered application will operate without conversations (e.g. a machine learning model that identifies defects in a factory line) the future of user-facing AI-applications is undeniably conversational.

Even in a possible post-LLM world (as hard as that is to imagine right now), it will be impossible to take users back to a world where they cannot just talk to their software to tell it what they want.

Similarly, AI Agents will continue to depend on conversation to interact with each other and with humans and receive instructions about the goals they should achieve.

The directness and intuitive nature of simply asking for what we want has become too valuable to abandon.

We've crossed a threshold where technology now adapts to human communication patterns rather than forcing humans to learn machine syntax.

In this context, conversations are a first-order component of any AI Application. Now, when thinking about the architecture of your conversations in your application you have a number of decisions to make.

What is the structure of the conversation? What is the structure of a message? Can every participant (human or AI Agent) post a message into a conversation at any point? Is there a more strict, protocol-driven, turn-taking approach?
Are conversations divided into smaller sub-conversations based on context or is it all one long exchange?
How is conversation context understood? Is it purely a derived based on the messages exchanged or is context imposed because of business processes that are ongoing?
How can past interactions be retrieved and analysed by the participants of the conversation? Does the conversation management component offer any abstractions or is that left to the participants to manage?
Is the conversation structure independent of the application domain or is it optimised to best suit the needs of the specific domain?

Too often application development simply starts with implicit assumptions (most often led by what an API such as the OpenAI API offers) rather than thinking about what the application actually needs.

Here we will explore the different components of conversations and in future articles we will describe different strategies that existing systems have taken.

Participants

Conversations start with Participants. A Participant is any entity that can contribute to a conversation.

Humans: Humans are typically the users, although we can imagine other roles such as moderators, reviewers, etc.

Agents: Agents as more independented and goal-directed entities can also contribute to a conversation.

Systems: Finally, you may have other systems such as tools, API calls, dedicate orchestrators or other management components that in a less pro-active manner inject information in a conversation.

Messages & Events

Conversations are primarily made up of Messages. Messages will have a sender, the message content and metadata associated with that message (timestamp, id, etc). The message content can be plain text or it can be structured to make it more amenable to representation in various forms.

Depending on the sophistication of our application it may also be useful to consider a conversation containing Events in addition to messages. Events can manage and structure message flow and broader conversation policy and administration needs.

For example an event may be used to indicate that a new Participant has joined the conversation, or that a new knowledge document has been added to the conversation.

An event may be triggered by a workflow tool to indicate a conversation start or end, or an explicit context change or it can be used to inject information from APIs and external systems.

Context & State Representation

Context is a high-level description of what is currently happening in our system. Context can either be implicit, i.e. needs to be derived by interpreting all the different individual messages and events, or it can be explicit, i.e. have explicit label definitions of context, or a hybrid.

Explicit definitions of context can be particularly efficient, especially when considering business process. Implicit context derived through messages is much more flexible but carries with it the risk of misinterpretation.

Another way of managing context and state is to provide specific checkpoints or summaries that are derived from an existing set of messages but provide a more efficient representation as a summary so that a system interacting with the conversation artifact does not need to reason about the entire conversation every time.

Conversation Policies

A Conversation Policy is a description of what can be said at any given point in a conversation. For example, you may adopt a simple turn-taking policy, or you may impose more sophisticated controls where specific conditions need to be met before someone can participate in a conversation. Conversational policy could also dictate not only who can participate but what can be said (allowed intents) and what can be done (allowed actions). Being able to define policies, especially in more complex business settings or settings where multiple agents and humans need to collaborate and co-ordinate is crucial.

Conclusions

I think this gives us a solid starting set of concepts for how to think about and structure conversations. What I hope is evident is that it is not a simple case of ordering messages and then letting LLMs through prompts reason about them. Much more needs to be considered and careful thought should go into the specific needs and goals of your AI application before you settle on what conversational approach will be right for you.

There is also much that we haven’t addressed. How do multi-modal conversations impact our understanding of a conversation model (if at all)? How do we recover from deviations on conversation policies? Who is responsible for enforcing a policy? How do we trust the statements other participants are saying. We will get to all of this in due course! If you want to follow along please subscribe.

CIAO Design Principle #2: Explicit Rules in Code Beat Implicit Rules in Prompts

Ronald Ashri — Mon, 12 May 2025 16:45:47 GMT

In the previous post we introduced CIAO - an architectural framework for AI applications and our first design principle: “Focus first on the what of AI applications, the fundamental capabilities and interactions required, rather than the how, the specific technologies and implementations used to realise them”. Before diving deeper into CIAO, I’ll discuss the second design principle, one that may ruffle some feathers.

Explicit rules defined in code are better than implicit rules defined in prompts.

One of the first victims of applications designed using LLMs (and non-deterministic AI systems in general) are explicit definitions that can programmatically be reasoned about.

Remember clear, unambiguous statements like status = "active" or rules like if temperature >= 20 then disable_heating()? These aren't just code—they're beautiful things. They're efficient. They provide full control.

If you can solve a problem using just rules and explicit knowledge statements, you should be proud of yourself.

But somewhere along the way, "rule-based" became a bit of a dirty word. Rule-based is old and unfashionable. Everything now needs to be defined in prompts, as if deterministic code is somehow inferior to natural language instructions.

Here's the thing: prompts are rules too. The only difference is that you cannot explicitly control them.

The Hidden Trade-off

Every time we replace a deterministic rule with a prompt to an LLM, we're making a trade. We gain flexibility, natural language understanding, and the ability to handle ambiguity. But we lose something precious: certainty.

This certainty isn't just a technical nicety - it's the foundation of reliable systems. When a traditional rule executes, you know exactly why a decision was made. You can trace it. You can fix it when it breaks. You can guarantee it will behave the same way tomorrow.

With LLM-based systems, these guarantees evaporate. The same prompt can yield different results depending on the system's mood (or more technically, the non-deterministic nature of the generation process).

The False Dichotomy

The industry has constructed this false choice between "outdated" rule-based approaches and "innovative" AI solutions. This framing ignores a fundamental truth: we need both.

Some problems demand the precision and reliability of explicit rules. Others benefit from the flexibility and contextual understanding of LLMs. The art lies in knowing which is which.

Being Explicit About What's Implicit

For AI applications, we need both explicit and implicit reasoning, and most importantly, we need to be explicit about what is implicit. We need to be clear about what we don't fully control so that we can ask ourselves the question about whether we need to control it and how we might go about doing that.

This means documenting where non-determinism exists in our systems. It means setting clear boundaries around where LLMs or other AI technologies make decisions in non-deterministic ways versus where traditional logic prevails. It means building guardrails that constrain the space of possible outputs without eliminating the benefits of flexibility.

At a cultural level it means learning to love explicit rules once more. If you can solve your problem with good-old fashioned programming you should absolutely do it. Have the LLM write the code for you if you want to feel better about it ;-)

The Path Forward: Thoughtful Hybridization

The future doesn't belong to pure LLM applications or pure rule-based systems. It belongs to thoughtful hybrids that leverage the strengths of each approach:

Use explicit rules for critical business logic, especially where legal, ethical, or safety considerations are paramount.
Deploy LLMs for understanding context, handling natural language, and managing ambiguity.
Create transparent interfaces between deterministic and non-deterministic components.
Build tools that help us understand and constrain behaviors.

Questions We Should Be Asking

For every component in our AI systems, we should be asking:

Do we need deterministic behavior here?
What's the cost of unpredictability in this context?
How will we debug issues when they inevitably arise?
Can we provide guarantees about system behavior?

The Conscious Choice

As builders, we have a responsibility to make conscious choices about where we embrace the power of ambiguity and where we maintain the clarity of explicit definitions.

When we do this well, we get the best of both worlds: systems that can understand and adapt to the messy reality of human needs while maintaining the reliability that makes software trustworthy.

So before you rush to replace your rule-based system with an LLM prompt, ask yourself: am I gaining more than I'm giving up? Sometimes the answer will be yes. But sometimes, those explicit definitions might be worth keeping after all.

As we evolve CIAO we will constantly be on the lookout to make sure there is space to ask these questions.

To follow along with the work on CIAO please subscribe:

CIAO 👋 - A Unifying Architectural Framework for AI-powered applications

Ronald Ashri — Fri, 09 May 2025 14:38:09 GMT

We are creating a new type of software application. With LLMs as the catalyst technology, we are building applications that are conversation-centric with increasing levels of automated decision-making and the ability to handle a variety of input and output types. New applications types call for new architectural frameworks that better capture the core concepts.

A clear framework enables us to separate concerns, compare and constrast alternative approaches, describe common solutions as reusable patterns and in doing so create more robust and more reliable applications.

There are several examples of such frameworks and how they’ve helped build better applications. In the web application development world MVC (Model - View - Controller) helps teams think about the separation of data, business logic and graphical representation. Faced with new cloud-based scaling capabilities the Microservices architecture helps teams to reason about how to deploy independent services that interface over standardised APIs. Speaking of APIs, the REST (REpresentational State Transfer) style gave teams a way to think about how to structure and approach their API development.

Frameworks are not just for software developers. From a design or user experience perspective we have similar needs. Whether it is something high level like atomic design, or user journey maps or something more specific such as progressive disclosure or designing information architecture based on card sorting these frameworks are invaluable tools. They allow us to share a common understanding, a common language and reduce the cognitive load required to reason about complex problems.

Creating such as framework for AI-powered applications is something that has been on my mind for some time now. I’ve been toying around with what the main concepts should be, how they relate to each other and writing small pieces of code to prove out ideas as I was going along. From today I’ll continue that process but share the ideas and their evolution in the open on agentsdecoded.com. If you want to keep updated subscribe.

Subscribe now

CIAO👋 : Conversations, Interfaces, Agents, Orchestration

CIAO provides a unifying architecture framework to address the challenges of building applications that deal with ongoing context-aware interactions, multimodal communication, automated decision-making and dynamic co-ordination.

A key design principle of CIAO is to focus first on the what of AI applications, the fundamental capabilities and interactions required, rather that the how, the specific technologies and implementations used to realise them.

For example, the architectural need to understand natural language (a "what") exists independently of whether an LLM, semantic parser, or hybrid approach (all "how" concerns) is employed. Similarly, the need for an agent to reason about and execute actions independently (a "what") can be implemented through various technologies and methodologies (the "how").

This separation of concerns allows us to:

Focus on capabilities and interactions before technology selection.
Adapt to changes in technological capability without architectural overhaul.
Make implementation choices based on specific requirements and constraints.

In short, it is about freeing ourselves as designers and builders from what is “trendy” or “in fashion” and focussing on what is required and is the right solution for the task at hand.

In a world where LLM-powered software development is making the actual writing of code increasingly “cheaper” the most important question you need to ask yourself is what you need to build.

The CIAO High-level Architecture

There is still work to do to determine what a high-level diagram capturing the interactions between CIAO components should look like but here is a starting point.

CIAO high-level architecture

Human Participants

Human Participants represents the users who interact with the application. They bring their own goals, knowledge, and contexts to interactions, and the system must model those internally and adapt to their needs and preferences.

Conversations

Conversations manage the ongoing exchanges of information and intent between participants - both human and software (agents) participants. Conversation components need to maintain context and track dialogue state among multiple participants. They are the common shared artifacts created through the interactions of participants (human and software agents). We can imagine different conversation components to describe different conversation styles from completely open-ended conversation to highly structured protocols witch just a small set of possible speech acts.

Interfaces

Interfaces provide the interaction channels between human participants and the system. They handle different interaction modalities (text, voice, visual, etc.) and are responsible for capturing inputs and rendering outputs in human-understandable forms. The Interface layer manages presentation concerns, accessibility, device adaptation, and creates appropriate representations of system state and activities for human consumption.

Agents

Agents are components with their own automated decision-making capabilities (I am striving hard to steer away from using the word autonomy). Agents will have some implicit or explicit representation of goals, may possess domain knowledge, reasoning capabilities, and the ability to execute actions. They respond to requests coordinated through Orchestration, perform tasks, access external systems and share outputs and interact with other participants in the system (humans or not) through conversations.

Orchestration

Orchestration serves as the coordinator of the system, connecting all other components. It manages system-wide state, controls workflows, allocates resources, handles errors, and ensures components work together coherently. Orchestration may maintain the overall application logic and enforce policies. You may have a very thin orchestration layer or a very sophisticated one depending on the needs of your system. We could even envision multiple orchestrator components depending on the specific type of application with more elaborate topologies of participants, interfaces and conversations.

Next Steps

With an initial set of concepts in place the next article will deal with two different challenges. First we are doing to provide more detail about the core concerns of each component (e.g. dialogue management for conversations or how we think about goals in agents), secondly we are going to start using the framework to describe specific use cases and start relating that to how they might be implemented in existing tools. The latter will start indicating where we may have space for improvement or are missing.

If you are curious about how this will evolve sign-up to get notified. I’ll aim to post one or two updates a week.

Changing tack ⛵️

Ronald Ashri — Tue, 06 May 2025 16:53:33 GMT

Changing tack

Stopping with the complaints

I realised that on agentsdecoded.com I am complaining quite a bit. I criticise the big LLM companies, criticise efforts to generate standards, criticise the way AI is introduced, criticise how people think about agents and on it goes. There is value in complaining but ultimately it is tiring for all involved.

Don't get me wrong - I am completely sure that I am correct about everything 😉 but in a space as hyped-up and fast moving as AI Agents you can easily feel overwhelmed about the sheer amount of things to complain about and easily lose sight of the exciting aspects of the technology.

I want to feel excited about technology not just shout about all the things that are wrong.

So I’ll change approach.

Instead of complaining, I am going to focus on offering an alternative vision of how we can go about building and using applications that use agents. This will come in the form of conceptual frameworks, software artifacts, architecture proposals, patterns and methods. Some of these things I may decide (hopefully with useful input from others) that they are not actually useful and will discard (disown) them. Some we will hopefully keep. By allowing for things to be discarded I am giving myself permission to err on the site of creating and thinking in the open.

As we are doing this we are going to connect it to existing efforts other people and organisations are undertaking and perhaps do just a bit of complaining or "robust contrasting", but only if there is a clear alternative to the thing we are complaining to. Since offering alternatives involves actual work hopefully that will keep the complaining to a low level!

I will be introducing the various ideas and approaches in small bite-sized pieces to make the task easier and motivate people to keep coming back for more. So if you are not subscribed yet - this is your chance!

Subscribe now

Ok - let's get going! First thing to talk about is what is it that we are even attempting to build. What are these applications that use automated decision making and dialogue so heavily and how do they change things. A lot of this is not going to be new but we need to start at the start.

What are these "AIs"? Do we need new ways of thinking specifically for AI-powered applications?

Automated decision making capabilities coupled with natural language understanding connected to various bits and pieces of classic UI and UX solutions and underpinned by plenty of traditional software engineering is all mashed together into these vague thing that we call an “AI” or a “chatbot”. Are these things really different to everything we’ve been doing before? I think the answer is a clear yes.

The days of menu-driven, button-driven applications were we had to find the right combination of things to select and click in order to get something done are not gone but the context has radically changed.

Let’s think about the change from a few different perspectives.

1. From the user perspective there is more of an expectation that we can directly state what we are after using natural language and the application will react to it. As users we no longer need to understand how the application works. At least not initially. The onus is the application to understand us (or at least our request). This frees us up to dive in straight away and it also makes us even more impatient (and potentially obnoxious) customers.

2. From an user experience design perspective there is a different set of requirements. A lot of the traditional techniques that were designed to guide the user through an unfamiliar space do not apply in the same way. We can let the user ask and the application can reveal stuff along the way. A prime example of this approach are Anthropic's Claude artifacts. The primary interface is chat, but then Claude might decide to throw up a code interface, or a word processing interface - because it is the "right thing" given the request.

3. From a software design and development perspective these new applications require completely new ways of describing and managing them. Old patterns, especially from a software engineering perspective, are still relevant but they need to be accompanied with new ways of describing and understanding these applications and their behavior and better ways of capturing they whole. In addition, we also need to consider that not only are the applications different and need new ways of describing them but the way we build them is rapidly changing with the use of code generation and design generation tools.

4. From an organisational perspective these new applications represent opportunity and challenge. The future they promise of boundless automation is so tempting to the capital-driven organisations but also scary as it is very different from the present and it is not clear how one gets from now to wow without lots of things breaking along the way. New governance structures, new business models, new societal models - it all seems up for grabs.

5. From a regulatory perspective we have a renewed emphasis on responsibility and explainability. Who decided what should happen when, and where does the responsibility lie when we follow a certain path after a series of interactions with an application is now more important than ever. It is not completely new, we already have some regulations in place that we would do well to pay more attention to, but there is undeniably something more specific to articulate from a regulatory perspective when dealing with these "AIs".

So any which way you look at things AI-powered applications are different. As a result we need new ways to describe, design and maintain these applications.

AI Agents as a high-level abstraction capture the nature of these applications quite well. They allow us to talk about goal-directed behaviour, about proactive and reactive behaviour and about social ability.

Thinking of AI-powered applications as just LLM-powered applications is probably not a good enough abstraction and thinking of AI Agents as just LLM-driven applications is also not a good abstraction.

Although LLMs have been a crucial catalyst it is useful and sane to think about the applications independently of LLMs. LLMs are one way to manage conversation, dialogue and reasoning. Even though currently they are occupying a lot of mindspace they are not always the best way to do all the things.

An understanding of AI-powered applications or AI agents that equates the entire application to an LLM is doing it wrong. It is limiting our ability to think about the needs of the application because we are constrained by the capabilities and limitations of LLMs.

If you've been paying attention I promised small more frequent pieces, so in the cheapest of cliff-hangers after all that setup please sign-up to get first view of my framework for describing the core components of these new applications in just a few days!

[Emerging Standards] A2A Protocol: New Era of Agent Collaboration or Just APIs in a Trench Coat?

Ronald Ashri — Sun, 13 Apr 2025 08:13:44 GMT

On April 9th, in a blog post of appropriate levels of fanfare several senior Google figures announced the Agent2Agent protocol that would usher “a new era of Agent Interoperability”.

I am not exactly sure what was the old era of Agent Interoperability, but rest assured a new era is on its way. Excellent.

So let’s dive in. What exactly does the A2A protocol give us:

“The A2A protocol will allow AI agents to communicate with each other, securely exchange information, and coordinate actions on top of various enterprise platforms or applications.” - Google Blog Post

Ok. That sounds interesting. Why do we need it?

“To maximize the benefits from agentic AI, it is critical for these agents to be able to collaborate in a dynamic, multi-agent ecosystem across siloed data systems and applications. Enabling agents to interoperate with each other, even if they were built by different vendors or in a different framework, will increase autonomy and multiply productivity gains, while lowering long-term costs.” - Google Blog Post

Got it. So Google (and several other providers) are envisioning a future where (LLM-powered) applications will be reaching across silos and asking other (LLM-powered) applications to perform tasks for them. An application to application interface if you will - an Application Programming Interface even or API for short. Fascinating.

Anybody else trying this? Here is something:

the essential infrastructure to integrate AI agents with your entire IT ecosystem, allowing them to autonomously interact with APIs and drive efficient, high-quality customer interactions at scale. This ensures that AI agents can operate effectively within a secure and governed environment, enhancing overall business agility and performance.

This is Mulesoft Anypoint - from Salesforce. Salesforce also also supports the A2A protocol. That’s a bit confusing. So which one do we use, when? Products like Mulesoft have been around for ages. Why exactly do we need a specific Agent 2 Agent protocol? Seems we’ve been grappling with integration issues for a while here.

I know! A2A is when we want to get two Agents to interoperate. We’ve figured out that an API or data integration could not possibly solve the problem. The methods that we have decades of experience on simply will not cut it. We need to apply agentic capabilities and we need LLM-powered decision making. Got it.

Then I am sure that A2A will be solving tough, agent specific issues. Not the age-old API and data integration issues. This is about a new agentic era! Luckily we even know what the issues are. After all we’ve been doing multi-agent systems research for several decades. We know what are the challenges around discovery, trustworthiness, reliability and so on. So let’s not rush to conclusions about A2A. Let’s dive into the protocol and figure out how it addresses the really tough issues. I am pumped!

A2A design principles

Let’s have a look at the design principles for A2A.

Embrace Agentic Capabilities. A2A focuses on enabling agents to collaborate in their natural, unstructured modalities, even when they don’t share memory, tools and context. We are enabling true multi-agent scenarios without limiting an agent to a “tool.”

Ok. This is good. You’d hardly want an Agent 2 Agent protocol to not embrace agent capabilities. It would really, truly be just an API then I guess. I am not sure what a “natural, unstructured” modality is but sounds great. Ok. What next?

Build on existing standards: The protocol is built on top of existing, popular standards including HTTP, SSE, JSON-RPC, which means it’s easier to integrate with existing IT stacks businesses already use daily.

Cool, cool, cool. Imagine if we had to put in place entirely new standards - like a REPLACEMENT FOR HTTP. No, this is wise, very wise.

Secure by default: A2A is designed to support enterprise-grade authentication and authorization, with parity to OpenAPI’s authentication schemes at launch.

Authentication and authorisation. Perfect. Good thing this didn’t slip by. I am sure this will pretty much handle any and all security issues with having two LLM-powered applications hallucinate their way through complex tasks. Fantabulous.

Support for long-running tasks: We designed A2A to be flexible and support scenarios where it excels at completing everything from quick tasks to deep research that may take hours and or even days when humans are in the loop. Throughout this process, A2A can provide real-time feedback, notifications, and state updates to its users.

Interesting. Yup - sometimes ~~applications~~agents do take a long time. We need a way to keep in touch and update other ~~applications~~agents on what is happening. Looking forward to diving into the detail of this one. I wonder if I could subscribe to some sort of event and then be notified. That would be revolutionary.

Modality agnostic: The agentic world isn’t limited to just text, which is why we’ve designed A2A to support various modalities, including audio and video streaming.

Nice. This clears things up about that natural, unstructured modality mentioned in the first principle. I thought that might mean different modalities but actually this modality agnostic thing means that. So that must mean something else. Super clear.

So we want to support agents, securely, using existing standards as they work on tasks and notify each other. These are all good requirements, there are some light design principles somewhere there but it reads mostly as specific decisions rather than guidelines on how to make decisions. Anyway, let’s not get caught up in definitions. How about a bit more detail on how all this works.

How A2A actually works

The Github repo and this video provide some good insight into the detail.

Discovery of Agents and Skills

The first bit is discovery. My agent will identify a need for a skill or a task that it (presumably) cannot perform itself and will look into a “tool registry” for agents that might be able to perform this task. Agents describe their capabilities through AgentCards. These cards describe both practical information to facilitate communication and information exchange and, crucially, the skills, of the agent. The things it is able to do.

One interesting question here is how do we unambiguously describe a skill without assuming some prior knowledge or assumption. How do we know that the advertised skill is exactly what we need. When we hire people we put out a job ad, we look at a CV but then we run an interview to make sure the candidate it a good fit. With agents if they say they can "run HR activities such as candidate sourcing” are we just going to assume that that will do everything we need or is there an implicit assumption that this will be good enough. If we have implicit interoperability assumptions in place where does that leave the dynamic, autonomous agentic future?

Also, it’s not clear how my agent decides that it cannot do the task itself but I assume that this is similar to choosing a tool, only in this case the tool is a whole other agent and the protocol is A2A rather than standard tool calling.

How agents are dynamically discovered is also not tackled. For the time being I assume that if this was used in production setting it would simply be a case of letting all the agents involved know of all the agents they might use.

Tasks

The chosen agent will then be assigned a task. I am genuinely disappointed that the protocol designers went with the idea of a task rather than a goal. The whole thing about agents is goal-directed behavior. This was the one chance to make it clearly conceptually different. A goal is a description of a desirable state of affairs, a task is something a bit lower level. Oh well. This is where we are. We assign a task to an agent and a task can be in one of various states. Much like a job. I wonder if there are decades of tools helping us to manage long-running jobs in the IT world?

Collaboration

The other agent will then review the task and may have some questions. There is a specific ‘input-required’ state for this. That’s it. There is not really much to say about the collaboration mechanism other that it is very rudimentary right now. It is geared to some quite specific tasks that seem like they are making some strong implicit assumptions. It’s not as if there is decades of research in speech act theory and agent communication languages based on speech-act theory from the 1970s.

Conclusions

It might not feel like it given this post but I do think work on agent-to-agent protocols is very useful. However, this feels rushed. It feels like a proof-of-concept of some technical capabilities without the deep thought that needs to go into an actual protocol. It feels like Google rushing to occupy some space because they have to win everywhere and Anthropic got a head start with MCP.

The thing that gets me is that what will inevitably happen is that some people will get excited, some thing will be implemented and will not work and then the assumption will be that agent-to-agent collaboration is not useful rather than the thinking behind this particular protocol was too rushed.

This protocol does not solve any of the hard conceptual problems. It provides an interesting point of reference to think about them. An organisation as big as Google, however, should be driving for setting standards in a more organised way. We have the W3C the IETF, there are bodies that can support standards in a much more structured way. It is clear that international organisations are not popular these days but that doesn’t mean they are not useful. It seems that we just prefer to barge straight forward, push out some half-baked idea, generate the hype that comes with it (a new era!) and then deal with the consequences later.

[AI Agent Diaries] The LLM industry's dirty secret and what to do about it.

Ronald Ashri — Sat, 22 Mar 2025 08:25:58 GMT

There are two errors with the title of this diary entry. Firstly, the LLM industry doesn’t just have one dirty secret, it has many dirty secrets. Secondly, they are not secrets. We all know about all the issues. However, it often takes some catalyst moment for something that people do know about to become something that is discussed widely enough and maybe lead to enough people caring about.

The dirty secret we are going to focus on today is the LLM industry’s use of copyrighted content. We’ve all know about this for a very long time. For example, there is the infamous Murati (ex-OpenAI CTO) interview with the Wall Street Journal where the incredibly capable and intelligent ex-CTO of OpenAI somehow was “not sure” what data was used to train Sora or couldn’t quite remember.

Since then we’ve learned a few more things. Most interestingly, there is the internal Meta communications released following legal action. Through their embarrassingly superficial debates we find out that engineers explicitly discussed whether it would be ok to use LibGen - a library containing millions of pirated books - because Meta’s competitors where doing it as well. Ultimately, Mark Zuckerberg approved the use of pirated material - living his “move fast and break things” motto to the full.

Then this week The Atlantic released a simple search tool that allowed anyone to see if their own books where used to train LLMs. In what I like to think of as reverse-ChatGPT moment people where horrified to find out explicitly that their hard work was used to train LLMs posting the results on social media in absolute outrage. The enthusiasm of people when OpenAI productised LLMs through ChatGPT and made it plain for all to see their capabilities turned to horror when the Atlantic productised the ability to determine the level of piracy that went on behind the scenes to train those LLMs.

Of course I did a little search for myself and while academic papers showed up that were public domain there is also a book that very much should not be public domain. I spent a good percentage of weekends in 2018 and 2019 writing that book. I don’t particularly care that its contents are publicly available for anyone to read without paying but I do care that it was used by organisations with access to billions of dollars without permission when they will very much complain if I don’t pay what is due for access to their services. Now, my main job is not writing. I can’t imagine how horrified people whose livelihood depends on authoring must feel.

My own contribution to the LLM industry

The LLM industry are Modern Day Robber Barons

Let’s call a spade a spade.

What the LLM industry is doing is not dissimilar to what robber barons across the ages have done. Sam Altman, Mark Zuckerberg and others have identified a natural resource, content produced over centuries of human labour, and a means to exploit it, productised LLMs.

The analogies to Rockefeller, Carnegie are so plain. Monopolistic practices (they want their government to protect them from external competition), vertical integration (they manage every stage from the data centres to the end-user products maximising profit extraction), political influence (literally falling over themselves to gain access to government), while at the same time arguing for minimal environmental regulations, exploiting labour, displacing existing industries and people (replacing jobs) and financial manipulations (like the ridiculous White House $500 billion Stargate announcement or Elon Musk’s combination of X with X.ai).

What are we to do about it

There is a series of arguments against LLMs, from how the technology is fundamentally broken to how it is unethical. They all have merit for me. However, I also know that the technology is extremely useful and it is here, it is open source and it is not going away.

We cannot ban LLMs and we should not. I know I am over-simplifying but LLMs per se are not the problem. The problems are the afore mentioned robber barons. They are breaking the rules without consequences and they are then using the gains from the rule-breaking to facilitate further rule breaking or rule re-writing for their benefit.

I think there are two things that need to be done.

The Robber Barrons and the LLM industry needs to pay. There’s a number of mechanisms and far more qualified people to define how but a combination of ongoing taxation, specific fines, and prosecution to the extend the law allows should take place. The robber barons need to be stopped.
The technology requires regulation. Yes this will slow it down, yes some use cases are not going to be viable, yes some startups are going to have to stop or change what they are trying to do. Yes the regulation is not going to be perfect. That is ok. We’ve broken enough things. Time we started fixing some of them.

Current developments in the world and especially the US give us a pretty good idea where the “move fast and break things” ethos takes us. Perhaps its time we tried something different.

AI Agents' Reality Gap: 7 Ways LLM-powered AI Agents can Create Real-World Problems (and what to do about it)

Ronald Ashri — Tue, 04 Mar 2025 07:41:56 GMT

"When Sarah decided to plan her dream European vacation, she turned to an AI-powered travel app that promised to create the perfect personalized itinerary. Made by a startup touting its amazing agentic capabilities, the app's recommendations looked flawless on screen. A sleek, curated mix of attractions, accommodations, and experiences that seemed to perfectly match her preferences and budget. With confidence in the AI's suggestions, Sarah booked everything through the app in a single click.

Reality proved disappointingly different. Her ‘best-value’ five-star hotel was in an unsafe area and not quite like the photos. Although the hotel photo gallery did have beautiful photos of beaches and sunsets, those were 20 minutes away through some shady neighborhoods, not the view from the hotel itself. The AI agent simply assumed that any photos on the hotel page described the hotel and the AI summary presented to Sarah didn't provide that crucial bit of detail.

The "off-the-beaten path" adventures turned out to be below average experiences that were very to get to without further expensive transportation and weird times the AI Agent did not account for. The "authentic" restaurants were either tourist traps, unsanitary or overpriced. A relaxing holiday turned into a series of stressful challenges as Sarah negotiated online with the AI Agent that continued to essentially gaslight her about the fact that the holiday was exactly as she asked for and no refund or change was possible."

Anyone who has worked with LLMs over the last few years will recognise how all these issues can arise. Hallucinations, misinterpretation, lack of context, overconfidence and feedback loops that solidify already bad assumptions.

I think the potential of LLMs to drive positive experiences is significant, however we will not get there by pretending that it is all rosy and amazing. We need to face the limitations head on and be clear about where LLMs and LLM-powered AI Agents can drive value and where, plainly, it may not be worth the effort.

Don’t get me wrong. I am not arguing that we should ignore or not use this technology. After all, I make my living selling solutions that are powered by LLMs. Actually, that is not true. I make my living solving problems for clients and if I cannot clearly demonstrate when and how LLMs can actually help them and provide real value to them I am not going to be left with any clients.

So let's dive in. Here are 7 ways LLM-powered AI agents can go wrong and some suggestions about what to do about it.

1. Data Misinterpretation

When AI agents misunderstand or miscontextualize input data, they can make dangerous recommendations. Making leaps from adjectives to real-world actions without any careful checking is a risk. For instance, an AI agent might encounter a hotel's self-description claiming "steps away from the beach" and recommend it to a traveler seeking beachfront accommodation, when in reality the property is a 20-minute walk from the shore—something a human would quickly spot by checking a map or reading between the lines of guest reviews mentioning the "beach shuttle service." Similarly, when a restaurant describes itself as offering "authentic local cuisine with a modern twist," the AI might present this marketing language as objective fact to a traveler seeking traditional local food, missing the indication that the dishes have been significantly modified from their traditional preparation. In both cases, the AI jumps from descriptive language to concrete recommendations without the crucial step of verification and contextual understanding.

Combining LLMs with more explicit reasoning and carefully curated databases can address some of these issues. We need to pay close attention to the data that we make available and how and whether we are able to verify data or make explicit assumptions. This is one of the challenges of Operator / Computer Use style agents that operate on the web. How will an Operator Agent judge the trustworthiness of a hotel accommodation website without upfront curation?

2. False Information Generation

AI agents can confidently present completely fabricated information as fact. There are ways to limit this by injecting specific knowledge in prompts, and while the very big issues can be caught - i.e., you can easily prevent an AI Agent from booking into a non-existent hotel - there is a long tail of additional cases that may not be that easy to catch.

For instance, an AI travel agent might accurately list a hotel's address and basic amenities but fabricate details about the "award-winning breakfast buffet" or "recent renovation" simply because the language for that is statistically possible. A financial AI might reference real market trends but invent specific analyst recommendations or company statements. These smaller fabrications are harder to detect because they are mixed with truth. LLMs, by their very nature, are particularly adept at wrapping up lies with truths preparing the infamous truth sandwiches, making automated fact-checking more challenging. Even more concerning, these fabrications can compound over time - an AI might read and incorporate another AI's fabricated details, creating a web of false information that becomes increasingly difficult to untangle.

If we are dealing with fully automated services and cannot rely on humans reviewing and fact-checking then it is important to start out quite conservative in the problem we are trying to solve. If we use LLMs to generate content with the specific aim of "selling" to users (e.g. adding prompt instructions such as "make it enticing") we are essentially encouraging them to lie.

3. Flawed Logic Chains

Even with accurate data, AI agents can make invalid logical connections that lead to harmful decisions. Consider a travel booking AI that notices a user often books early morning flights and concludes they "prefer" early departures, when in reality they've been choosing those flights reluctantly due to price and lack of time to search for other options. The AI might then rigidly book 6 AM departures for their vacation, ignoring better-timed options because it misinterpreted past behavior as preference. We’ve all experienced even the most sophisticated recommendation engines (e.g. Netflix “watch next”) get completely derailed because of some random choices of ours. In each situation, the AI draws seemingly logical but fundamentally flawed conclusions from accurate data points.

In building systems we need to have clarity of where we are using LLMs (or any other system really) to make assumptions about preferences and be able to explicitly trace and audit decision-making. In addition, the user needs to be made aware of the choices the AI Agent took. To be clear, the current trend of presenting the user with the “thoughts” of reasoning LLMs such as o3 is not the answer. We need to invest in user experience that is actually helpful, not this information overload torrent of thoughts.

4. Bias Amplification and Lack of Cultural Context

AI agents can magnify societal biases present in their training data. A lot has been written and said around this, especially when it comes to cases such as a lending system systematically undervaluing certain neighborhoods based on historical redlining data, but even if your AI Agent is not operating in these more sensitive areas, there is still room for bias.

Consider an AI travel assistant that provides superficial cultural advice like "always haggle in Morocco - it's what locals expect," reducing rich cultural nuances to stereotypes. This advice might be technically "correct" in some contexts but lacks the crucial nuance of when haggling is appropriate (markets vs. established businesses), how it should be conducted respectfully, and how it varies by region and situation. The traveler following this advice might offend locals by haggling in inappropriate settings or miss authentic experiences by approaching every interaction through this oversimplified lens.

Similarly, an educational AI might provide lower-quality assistance to non-native English speakers by misinterpreting grammatical patterns from other languages as indicators of lower comprehension, leading it to oversimplify explanations unnecessarily.

To address these issues, AI systems need diverse training data and explicit checks for biased outputs. More importantly, they need transparency about their limitations – being upfront about the demographics and contexts where their recommendations are most reliable and where users should seek additional perspectives. Designing AI agents with cultural humility rather than cultural authority allows them to make suggestions while acknowledging the limits of their understanding. Including diverse human perspectives in the development and testing of AI systems is essential, not optional.

5. Overconfident Speculation

AI agents often make bold claims beyond their actual understanding, presenting speculative information with unwarranted certainty. This problem is particularly concerning because confidence in language is not correlated with accuracy – in fact, LLMs often express their most inaccurate statements with the highest verbal certainty.

Consider a medical triage AI that confidently tells a user, "Your symptoms strongly indicate Guillain-Barré syndrome," when faced with common symptoms like tingling and weakness that could signify dozens of conditions. The AI has no ability to acknowledge the true statistical rarity of this diagnosis or understand that physicians would typically rule out more common conditions first. This false certainty could cause unnecessary panic or dangerous self-diagnosis.

Similarly, a legal AI might definitively state, "Based on Section 501(c)(3), your organization qualifies for tax exemption" without understanding the complex, multi-faceted nature of tax law determinations that even specialized attorneys approach with caution. An investment AI might declare with certainty that a specific stock will increase in value, without the capability to understand that market predictions inherently contain uncertainty.

The problem extends beyond professional domains. An AI travel agent might confidently assure a user that a certain trail is "perfectly safe and suitable for beginners" based on a few positive reviews, lacking the judgment to consider seasonal variations, the traveler's true fitness level, or recent weather events that might affect conditions.

Addressing this issue requires calibrating AI confidence to match actual certainty. Technical approaches include training models to express appropriate uncertainty and implementing confidence scores that reflect real statistical reliability rather than linguistic confidence. From a design perspective, AI systems should be built to clearly distinguish between facts, recommendations, and speculations in their outputs. Most importantly, users should be educated about the limitations of AI knowledge and encouraged to verify critical information from authoritative sources.

6. Time and Context Errors

AI agents can fail to account for when and where their knowledge applies, treating outdated or contextually limited information as universally applicable. This temporal and contextual myopia creates a dangerous reality gap between AI recommendations and real-world conditions.

An investment AI trained on pre-pandemic economic data might confidently recommend heavy investment in commercial real estate or airline stocks, completely missing the transformative impact of remote work and travel restrictions. Unlike human advisors who lived through these changes and intuitively adjust their thinking, the AI continued applying outdated patterns until explicitly retrained. If you would prefer a more time-relevant example think of the current geopolitical shifts between the USA, Canada, and Europe on tariffs. What will an LLM trained on data up until December 2024 think of the February 2025 world?

Solving this problem requires multiple approaches. Technical solutions include regular retraining, explicitly tagging information with temporal validity indicators, and designing systems that can actively seek current information rather than relying solely on their training data. From a design perspective, AI agents should express appropriate caution when making time-sensitive recommendations and, ideally, verify current conditions before taking actions with real-world consequences.

Organizations deploying AI agents need robust update mechanisms and clear documentation about the temporal limitations of their systems. Most importantly, users should be encouraged to verify time-sensitive information and taught to recognize when an AI might be operating with outdated assumptions.

7. Error Cascades

Small AI mistakes can compound into major problems through repeated interactions and feedback loops, creating cascading failures that grow in severity over time. These error cascades are particularly insidious because they can emerge from systems that appear to be functioning correctly on a transaction-by-transaction basis.

For example, consider an AI travel assistant planning a multi-city international trip. The assistant initially misinterprets a traveler's preference for "morning arrivals" as simply "before noon" rather than "early morning." Based on this minor misunderstanding, it books a flight landing at 11:45 AM. This creates a tight connection for the next transportation leg, which then causes the traveler to miss a pre-arranged tour. The AI, seeing the missed tour, attempts to compensate by rebooking for the next day, which conflicts with other reservations. Each small misalignment compounds into increasingly significant disruptions, ultimately derailing the entire trip itinerary - all stemming from a single initial misinterpretation that seemed inconsequential when viewed in isolation.

Addressing error cascades requires systems designed with feedback loops and corrective mechanisms. Technical approaches include diversity in AI decision-making (using multiple models or approaches to cross-check recommendations), implementing breakers that prevent rapid sequential actions without verification, and creating monitoring systems that can detect patterns of small errors before they compound.

From a design perspective, AI systems should maintain uncertainty estimates that increase when sequential decisions are made without human verification, gradually becoming more cautious rather than more confident in extended autonomous operation. Most importantly, meaningful human oversight—not just theoretical ability to intervene, but practical mechanisms for review and correction—must be maintained in systems where cascading errors could cause significant harm.

Addressing the Reality Gap: Practical Solutions

As AI agents become more pervasive, it's crucial that we understand and mitigate these risks. Misinterpretations, hallucinations, and flawed logic from LLMs can lead to real-world consequences when actualized by AI agents - what starts as a small error can snowball into a serious deception.

Addressing these challenges requires a multi-pronged approach across technical, design, and organizational dimensions:

Technical Solutions

Grounding and Verification Systems: Implement mechanisms for AI agents to verify facts against trusted knowledge bases before taking action or making recommendations.
Uncertainty Quantification: Develop better methods for AI systems to express appropriate levels of uncertainty rather than false confidence.
Guardrails and Safety Measures: Create circuit breakers that prevent AI agents from making high-stakes decisions without verification.
Diverse Model Ensembles: Use multiple models with different training approaches to cross-check conclusions and reduce systematic errors.
Regular Retraining and Updates: Establish protocols for keeping AI knowledge current, especially for time-sensitive applications.

Design Solutions

Transparent Reasoning: Make AI decision processes visible to users in accessible, non-overwhelming ways.
Appropriate Agency Distribution: Design systems that maintain human control over high-consequence decisions while automating low-risk tasks.
Friction by Design: Intentionally add verification steps before irreversible actions, especially in high-stakes contexts.
Cultural Humility: Frame AI recommendations as suggestions rather than authoritative pronouncements in domains with significant cultural variation.
Feedback Integration: Create intuitive ways for users to correct AI errors and have those corrections reflected in future interactions.

Organizational Practices

Human-in-the-Loop Processes: Maintain appropriate human oversight, especially for consequential decisions or when working with vulnerable populations.
Clear Error Handling: Develop specific protocols for addressing AI mistakes, including compensation mechanisms for affected users.
Realistic Marketing: Avoid overselling AI capabilities, setting appropriate expectations about system limitations.
Rigorous Testing: Test AI agents with diverse scenarios and edge cases, particularly looking for systematic failure modes.
Continuous Monitoring: Implement systems to detect patterns of errors before they create significant harms.

By approaching these issues systematically rather than treating them as inevitable "growing pains" of AI adoption, we can develop AI agents that genuinely serve user needs rather than creating new problems. The goal isn't perfect AI – it's AI that fails gracefully, transparently, and with minimal harm when it inevitably makes mistakes while providing clear overall value to both the user and the organization.

Conclusions

I’ll close by focussing on what I think is the biggest ask of practitioners and organisations. Move slowly and do less. AI technologies push us to think big and be ambitious. They seem so capable that they tug at very basic human instincts of wanting to do more faster, especially in competitive settings. Instead, we need to cultivate our ability to constrain and contain. While there are multiple mitigation strategies, sometimes they may simply not be enough or be worth it. I know this is a big ask - we have developed a culture or more, faster - everything is transactional. Bigger buildings, faster cars, bodies that never age. Any risk is acceptable if it might lead to a big enough upside. However, we need to recognise that we are hitting limits and exhausting resources.

Perhaps our technological future lies not in more impressive cathedrals but in creating resilient neighborhoods of focused, well-understood tools each with clear boundaries and intentional design.

[AI Agent Diaries] Thinking Fast and Slow About AI Agent Development

Ronald Ashri — Sat, 08 Feb 2025 11:13:06 GMT

Thinking Fast and Slow

This week my colleague - - started writing about AI Product Development. Her excellent post on the challenges AI Product Managers face inspired me to run a favorite thought exercise of mine - reviewing the challenges of a specific space through the lens of Daniel Kahneman's "Thinking, Fast and Slow".

In the book "Thinking, Fast and Slow", psychologist Daniel Kahneman provides a conceptual framework that can help us better understand how we humans approach a wide range of issues. His basic tenet of two competing thought systems is particularly relevant when considering how we develop and deploy AI agents, especially in an environment of intense hype and competition.

System 1, as Kahneman calls it, is fast, instinctive, and emotional. System 2 is more deliberative, rational, and effortful.

When it comes to AI, we often allow System 1 to lead us down the wrong path by anthropomorphizing AI systems or relating them to patterns we've seen in science fiction and popular media, rather than understanding their true capabilities and limitations. We do not give ourselves the time to engage with System 2 and think through the harder problems.

AI projects, it could be argued, are almost perfectly designed to set these cognitive traps for our minds and lead us down the wrong path. Here are some examples.

Framing

What on the surface can appear as a simple request such as "we need an AI Agent," once unpacked, opens up a significant number of questions that go well beyond what people intuitively consider the domain of AI development. This framing significantly influences our subsequent choices. However, more often that not I come across teams that start by framing the problem of creating an AI agent in terms of "which large language model should we use" as opposed to "what user needs are we trying to address". The former will lead us down a very different (and not very useful) path.

Framing influences our future choices - Photo by Jaredd Craig on Unsplash

For example, a System 1 response might be to immediately jump to implementing GPT-4o or, dare I say, DeepSeek because it's the most advanced or most talked about model available. However, a System 2 analysis might reveal that a smaller, fine-tuned model would be more appropriate for specific use cases like customer service queries about product specifications.

Substitution Problem

The rapidly evolving nature of AI technology means there is an unusually high level of noise within the available information. It's easy to be misled about the actual complexity of implementing AI solutions and become overly concerned with specific technical aspects. Kahneman's substitution problem is particularly relevant here - we are naturally inclined to substitute complex questions (e.g., "How can we ensure consistent and reliable responses?") with apparently simpler, but less useful, questions (e.g., "What's the best prompting technique to use?").

Breaking through the noise to get to what is really useful challenges how we naturally think about AI.

The Planning Fallacy and Optimism Bias

Implementing AI solutions requires a variety of different disciplines to complete. AI strategy, prompt engineering, user experience design, data science, and software engineering all come into play. Each brings its own terminology and set of norms, and stakeholders are often asked to make quick decisions to keep the project on track and within budget. Unfortunately, as Kahneman explains, humans are afflicted by the planning fallacy. We tend to consistently underestimate the time required to develop and deploy AI systems, while optimism bias means we overestimate their benefits and capabilities.

The Availability Heuristic and Risk Assessment

Kahneman's research on the availability heuristic, our tendency to assess probability based on how easily examples come to mind, is also very relevant. When evaluating AI risks, teams often focus disproportionately on widely-publicized failures while overlooking more common but less sensational issues.

Teams might invest heavily in preventing chatbots from generating clearly inappropriate content (a highly publicized risk) while underinvesting in:

Monitoring for subtle gender or racial bias in model responses (I know its not cool to talk about these things anymore but there is still there…)
Testing for degradation in model performance over time
Evaluating fairness across different user demographics (see earlier comment about uncool things)
Measuring and improving accuracy for less common but critical edge cases

To counter this bias, teams should implement structured risk assessment frameworks that include:

Systematic logging of all types of failures, not just dramatic ones
Regular analysis of "quiet" failures that don't make headlines
Quantitative metrics for measuring bias and fairness
Automated testing for common failure modes

Think Slow to Allow AI to Act Fast

Understanding both how we think about AI and how users will interact with our AI systems is central to delivering successful AI projects. We improve our chances of success by recognizing that designing, planning, and deploying AI systems requires deliberative, rational, and effortful thought. A particularly challenging task these days where fast, impulsive and emotional action seems to be the norm.

We have to avoid numerous pitfalls and stereotypes and challenge our assumptions about AI to create something that achieves our goals. Framing these challenges as simply human traits that are perfectly normal and can be overcome by recognising them and talking about them helps teams reason about them more carefully.

Successful AI implementation requires a team that recognises our limitations as humans so that we can avoid these cognitive pitfalls. This might include comprehensive discovery workshops that directly address the framing issue by getting all stakeholders aligned on the actual problem we're trying to solve with AI. Regular technical debates and continuous learning ensure teams don't shy away from hard questions and avoid the substitution problem. Finally, cross-functional teams working closely with stakeholders through an appropriate methodology help keep the project state in check and maintain realistic planning focused on deliverable outcomes.

In short, we need to engage our deliberative and rational side when developing AI systems to produce solutions that are efficient, intuitive, and naturally interactive.

[AI Agent Diaries] AI Maximalists Vs The People

Ronald Ashri — Sat, 01 Feb 2025 08:40:36 GMT

Choice and user agency - Photo by Vladislav Babienko on Unsplash

While the AI experts and influencers are falling all over themselves trying to outdo each other with discussions and analysis of new LLM models (DeepSeek, Alibaba, Mistral, OpenAI o3) a rebellion is brewing. From Reddit boards to Twitter and BlueSky threads, the people are fighting back.

Users are sharing elaborate workarounds to disable AI features in the products they have to use every day. Google Gemini - eager to summarise every single email, Apple Intelligence - keen to let me know that a “long” update has arrived (🤷🏽‍♂️), Copilot in Word - Clipy’s arrogant nephew. Perhaps the most telling one, though, are AI answers in Google search with users explaining how you can use swearing in your search phrase to switch off the feature. The resistance isn't just digital fatigue; it's a visceral reaction to a maximalist view of AI - more AI always equals better.

Yet, as users desperately search for ways to reclaim their pre-AI interfaces, a parallel narrative unfolds. Writers are getting excited about how they can use AI not as a replacement for creativity but as a sophisticated research assistant. Scientists are accelerating drug discovery. Engineers are able to explore thousands of solutions in minutes.

This dichotomy, AI as an unwanted intruder Vs AI as a revolutionary tool, reveals a fundamental misunderstanding by tech companies of how humans want to interact with technology.

The AI maximalists, in their fervor to integrate artificial intelligence into every digital crevice, have forgotten a basic principle: tools should serve their users, not the other way around.

The evidence for this disconnect isn't just anecdotal. Recent research from Irrational Labs reveals a stark truth: explicitly labeling features as "AI-powered" actually decreases user trust and doesn't increase willingness to pay. Users aren't impressed by AI buzzwords; they're looking for concrete benefits and real solutions.

As someone who sells tools to embed AI everywhere, I should be the quintessential AI maximalist. But my position in the industry has taught me something crucial: long-term success in AI integration isn't about forcing technology into every possible corner of user experience. It's about thoughtful implementation that prioritizes user choice and concrete benefits.

The path forward should be clear. Tech companies need to stop treating AI like a marketing checkbox and start treating it like what it is: a powerful tool that users should be able to employ on their own terms. When AI features are optional, transparent, and clearly tied to specific benefits, users embrace them. When they're forced, hidden, and justified with vague promises of "enhancement," users revolt.

The future of AI won't be built by maximalists pushing for integration at any cost. It will be built by companies who understand that the most powerful feature they can offer isn't AI itself - it's the ability to choose when and how to use it. In the end, the difference between AI as an unwanted intruder and AI as a revolutionary tool comes down to one simple principle: respect for user agency.

[AI Agent Diaries] The One Where China Stood Up

Ronald Ashri — Sat, 25 Jan 2025 09:19:36 GMT

I started out AgentsDecoded thinking that in-between opinions, reviews of frameworks and pieces exploring the essence of agency I will include a weekly news column. After three weeks of doing that though I realised two things:

1. There is so much news every week I never feel I am actually doing a proper job of it.

2. There are already enough other newsletters gathering announcements and news, the world does not need another one.

What I think I will do instead that is hopefully more useful to you (and me), is use the weekly column cadence as a way to reflect on the week that has gone by. This will be a very specific point of view of a startup co-founder and CPTO down in the trenches working with people to implement AI solutions in the real world, far away from the dizzy heights of the leading AI labs and research centres with billions in investment.

Here we go.

Looking back at my week, we started off with some user research at work to gather data for how we should shape our product roadmap. It is always so powerful to get unbiased views of how people are looking to use AI Agents and where they are today. Two interesting perspectives are emerging.

1. There is a lot of opportunity in just helping companies upgrade from old Conversational AI tech . There are many mid-sized organisations that have built chatbots using the technologies post 2023, with standard ML intent classifiers and what is now essentially obsolete Conversational AI capability. All these organisations are now trying to figure out how to retool and move to technology stacks with GenAI but it is a slow process (certainly much slower than the rate at which LLMs are progressing). Definitely an opportunity for agencies out there to help people move up.

2. There is an army of consultant AI Agents coming. The vertical AI Agent space is ripe with opportunities. People with specific domain expertise (e.g. a specific niche of contract law) can develop solutions that combine their expertise with LLM capability - effectively turning what they know as an experienced consultant into a product. We will see a flurry of startups that are led by human consultants developing AI Agent Consultants and selling those (together with their standard services).

In the wider world of AI Agents three news stories dominated the agenda.

Let's start with the Stargate Project. Five hundred billion (!) in AI infrastructure investment in the US, confusingly announced from the White House but actually funded entirely from private organisations, of which apparently only a small part is currently available if we are to believe what Elon Musk said, that is not related with the project and is annoyed because his enemy Sam Altman is involved. Confused? Yeah, me too. What it does tell us is that nations are getting competitive. They want to lead the AI race, and win the AI wars - whatever that is supposed to mean.

The second story, which provides an interesting counterpoint to the first one is that DeepSeek, a Chinese lab that was previously mostly unknown to the western world and media, is crushing it at that very AI race. Releasing models that are cheaper, faster, better and open source. Sure they are standing on the shoulders of work done by US-based labs - but they are introducing innovations that are entirely their own. Europe and the US likes to tell itself that China just copies stuff, but that is clearly not the case. Among various geopolitical implications it also means that prices for access to LLM functionality are going to quickly start going down all around - which creates an interesting and challenging dynamic for the much more expensive to run US and European labs.

Finally, OpenAI released Operator. It's first AI agent. Essentially a more complete version of Computer Use similar to what Anthropic released last year. I saw two types of reaction online. "WOW!" and "meh". Ignoring all the hype coming from the increasingly large army of “AI influencers” there are good reasons for both excitement and reserve. As I mention here, while it is impressive stuff the uses cases and user experience are not quite there yet.

Much more has happened over just the past few days (e.g. Anthropic released Citations) but I am already a bit over 700 words and that is the cap I set myself! Now, off for the weekend, for hopefully a couple of days where AI is not dominating what I am doing!

This post is public so feel free to share it.

AI Agents are not a technology. They are a strategy.

Ronald Ashri — Mon, 20 Jan 2025 16:42:07 GMT

The debates rage on about what constitutes an AI Agent.

Interestingly, a lot of the focus is on defining what should not be considered an agent. "If it's a single prompt, it's not an agent!", "If it's a workflow of prompts, it's not an agent!", "Unless its writing its own code it's not an agent!" and on it goes.

Everyone seems to keep raising the ante in a race to distinguish themselves. In the middle of 2024 Hugging Face defined agents as "a program driven by an LLM". By the end of 2024 they updated that definition to say "LLM outputs control the workflow" and introduced a star system (3 stars is the max just like Michelin restaurants). You only get 3 stars if an LLM controls "iteration and program continuation". Anthropic are similarly trying to create tiers of agents, although the definitions are somewhat confusing. If you have a workflow of prompts it’s ok to call it an “agentic” system but what you created is not an AI agent. Agents need to "direct their own processes".

Understandably, anyone looking into the field is left baffled. They want to do the right thing by their team and business. Perhaps they've been tasked to figure out how their organisation is not going to be rendered obsolete by the relentless forward march of AI and they've heard that AI Agents are the future. Surely they want 3-star agents. Why settle for anything less?

I've been working in the agent space for a long time. I co-authored a book in 2004 called "Agent-based software development". I am a fan of the approach. However, I am worried that currently technology companies are spending too much time trying to out-agent the other agents and not enough time worrying about how AI Agents can solve actual, real business problems.

If you are trying to figure out agents let me offer a different way of thinking about them. Stop thinking about how they are built. Start thinking about what it is that they enable. Specifically, what is it that they automate and how much can you delegate to them.

AI Agents are the high-level building blocks of an automation strategy. They first encapsulate a conceptual approach - "we will delegate work to software" and, eventually, a technological framework. Conceptually, each agent has specific capabilities and specific goals. They have a degree of autonomy or, using term I prefer, self-direction in determining how to deploy their capabilities to achieve their goals. The assignment of goals and the degree of self-direction required are going to depend on what you are trying to automate and what is currently achievable (safely) with the technologies at hand. Through the agent abstraction you can plug into a rich ecosystem and research field that offers decades of outputs. Frameworks on how to define and manage goals, determine how best to have these programs communicate and collaborate, how to think about scaling, trust, safety, planning, negotiation and even argumentation.

What of all these technologies do you need to use today? How you go about building them? Depends on the task and the breadth, ambition and risk appetite of your strategy. They don't need to be any more or less complicated that what they need to be. There is no AI Agent license exam to pass. Don't worry about the definitions. 3 stars is not better than 0 stars. Look into the technologies, frameworks and patterns and find the one that will achieve your goals with the least amount of risk possible and in the most efficient way possible. Whether it calls itself a workflow, a multi-agent system or an agentless architecture does not matter. Your strategy does not change.

Your strategy is one of moving away from whatever way you are executing work today to one where AI Agents are embedded in your infrastructure to execute that work more efficiently and (hopefully) better. While technical distinctions between agent implementations matter - they inform what's feasible, maintainable, and scalable - they should flow from your strategic needs rather than define them. Similarly, while frameworks and standardization efforts can provide valuable guidance and common ground for evaluation, they should serve as tools for achieving your goals rather than constraints that dictate them. AI Agents are a strategy first, a technology second - but a successful strategy must be grounded in a clear understanding of the technological landscape and its practical implications for your specific context.

[AI Agent News] Jan 11 - Jan 17 2025

Ronald Ashri — Sat, 18 Jan 2025 07:45:45 GMT

Microsoft Autogen has a new version, primarily focussed on making it easier to build, monitor and improve applications. The underlying concepts remain largely the same but the framework is moving away from a way to experiment with ideas to something that directly addresses the needs of implementation at scale, including a low-code UI.

Anthropic reported it achieved ISO 42001 certification for responsible AI. “What does that mean?”, I hear you wondering. From the ISO website itself:

The ISO/IEC 42001 standard offers organizations the comprehensive guidance they need to use AI responsibly and effectively, even as the technology is rapidly evolving. Designed to cover the various aspects of artificial intelligence and the different applications an organization may be running, it provides an integrated approach to managing AI projects, from risk assessment to effective treatment of these risks.

“What does that still mean?”, I still hear you say. Anthropic is trying to distinguish itself from other competitors by spending more time ensuring that appropriate governance is in place for its systems. The hard truth, however, is that while a better level of governance is probably a good thing it is not a signal of any underlying change in the fundamental challenges these technologies pose. At least Anthropic is willing to do some of the extra work though. Having gone through ISO processes, none of this is exciting or fulfilling but it can uncover gaps and it does force an organisation to think through at least some issues.

OpenAI has a beta release of scheduled Tasks in ChatGPT. This is definitely very early beta and quite underwhelming. While a few are hailing it as the start of the “agentic” era at OpenAI and certainly it is a small first step, it feels more like someone decided they had to test the basic “hello world” implementation of task capability at scale before they move on to something more inspiring, even if it would inevitably get some bad press.

In the interesting reading department we have this paper from Microsoft on “Lessons from Red Teaming 100 Generative AI Products”. It does not disappoint. What does disappoint, is this report from Deloitte on agents and multi-agent systems. Aside from introducing intractable terminology such as making “cognitive leaps” and “reinventing processes” it then presents the most classic IT support triage scenario as a potential implementation of a multi-agent system! Look, I am all in on the benefits of these technologies but the fact that we seem incapable to temper expectations in actual big complex organisations is going to hinder sensible adoption. Reinventing processes and making cognitive leaps is not going to come through the current wave of multi-agent technologies. It can come if we use these technologies to free people up to think longer-term so they can actually do some re-inventing. Ok, rant over.

That wraps up another week in the agent world. Don’t forget to share and subscribe.

Subscribe now

[Framework Review] Hugging Face's smolagents Framework

Ronald Ashri — Tue, 14 Jan 2025 16:01:55 GMT

At the very end of 2024 (literally Dec 31st) Hugging Face released smolagents - “a simple library to build AI agents”. As is quite obvious from the name this is all about simplicity. Given that Hugging Face already had a library to support “agentic” work (Transformer Agents), it’s interesting to dive into this library, figure out what their motivations are, what it says about what AI Agents are and when you might want to consider using it.

Smolagents really is small 🙂

Why was smolagents created?

This is Hugging Face’s successor to Transformer Agents, a library that was getting updates all through 2024. I assume that the learnings from that library led the HF team to feel the need to start fresh. It looks like smolagents also captures HF’s convinction that having the LLMs write the code to execute functions, rather than generating the JSON that then is then used to execute function , is the way to go.

The following seem to be the high-level reasons:

Re-focus on simplicity: The logic for agents is contained within a relatively small amount of code, aiming to keep it easily understandable.

First-class support for Code Agents: smolagents prioritises agents that write their actions in code. Code is seen as the most effective way to express actions a computer can perform, offering advantages over JSON-like snippets in terms of composability, object management, generality and its presence in LLM training data.

Hub Integrations: smolagents allows for sharing and loading tools to and from the Hugging Face Hub, and it is expected that this functionality will expand.

Support for any LLM: It supports various models, including those hosted on the Hub, as well as models from providers such as OpenAI and Anthropic.

Agent Design Perspective: smolagents view on Agency and Autonomy

For every framework we analyse at AgentsDecoded we look for a deeper understanding of how AI Agents can be thought of and engineered. Our reference is our own conceptual framework for understanding agency and autonomy that provides a consistent set of terms to fall back on.

It’s interesting to track the evolution of HF’s thinking around agents.

From the May 13th, 2024 blog post announcing Transformer Agents we have agents defined as:

(..) an agent, which is just a program driven by an LLM. The agent is empowered by tools to help it perform actions. When the agent needs a specific skill to solve a particular problem, it relies on an appropriate tool from its toolbox.

A quite simple definition, which in today’s “discource” (i.e. people throwing random definitions on Substack, LinkedIn, X, bluesky, etc) would not quite cut mustard.

Fast forward to Dec 31st, 2024 and the introduction of smolagents, while the broad agent definition remains largely the same:

AI Agents are programs where LLM outputs control the workflow.

it is further refined to create a star system of agent levels. These refinement is indicative of a move in the overall market to tighten up the definition. While six months ago, people where happy to just call anything an agent essentially, now it is more about saying “watch out - not all agents are equal”.

Here's a breakdown of the star ratings from HF:

☆ (Zero Stars): At this level, the LLM output has no impact on the program flow. This is described as a "Simple processor", and an example is the pattern

process_llm_output(llm_response).

★ (One Star): Here, the LLM output determines the basic control flow. This is referred to as a "Router", and an example pattern is

if llm_decision(): path_a() else: path_b().

The LLM makes a basic decision that changes which path the program takes.

★★(Two Stars): In this case, the LLM output determines the function execution. This is described as "Tool call", and an example pattern is

run_function(llm_chosen_tool, llm_chosen_args).

The LLM chooses which tool (function) to run, and what arguments to provide to it.

★★★ (Three Stars): This level represents a multi-step agent where the LLM output controls iteration and program continuation. The pattern here is:

while llm_should_continue(): execute_next_step().

The LLM decides when to continue the loop or move on to the next step based on its observations.

★★★ (Three Stars): This rating also applies when one agentic workflow can start another agentic workflow (not sure why this is not a higher rating but it is not my rating system!), which is described as "Multi-Agent". The example pattern is

if llm_trigger(): execute_agent()

The LLM decides when to trigger another agent.

Analysis

While I can appreciate the pragmatic view of the different levels, things only really start getting interesting at two stars because that is then we start to potentially explicitly represent goals. Using the AgentsDecoded framework, at zero and one stars we are dealing with what we call passive agents. These are software programs that are are ascribed a goal by the user but don’t have any internal representation of that goal. At two stars we can start thinking of self-directed active agents. The software program is assigned a goal and it needs to figure out how to accomplish it. The simple function calling agent has reduced self-direction, it can only request that some other system execute a function while the three star agent has an explicit representation of a goal and can write its own code to effect change in the environment.

I don’t agree that having a single agent write its own code merits the same star level as having a multi-agent system. That is a fundamentally different type of system where issues of trust, shared knowledge and observability become far more important.

Overall, the challenge, from a definition perspective, that smolagents faces is that it ties things too specifically to the how something is done (i.e. how we are accomplishing something in code) and does not take a broader perspective on agents as goal-driven systems.

When to use smolagents (for Builders)

smolagents makes two important bets:

Keep it simple, stupid. Instead of coming with a bunch of opinions about how you should be designing agent-based systems it focusses on the minimum required and leaves the rest to you as a builder. This is a double-edge sword. If you have a good idea of what the rest should be then this is a great starting point. However, if you don’t know what you are missing you may be in for some interesting (and costly) discoveries along the way!
Writing code directly is better than writing JSON instructions. This is a big deal for smolagents. Instead of using JSON or text-based descriptions, the agent writes actions directly in code, typically Python. This is considered a significant advantage over other methods of defining actions.

The latter point is interesting so here’s a breakdown of what this means and why it's interesting:

Advantages of Code-Based Actions:

Representation in Training Data: LLMs are already trained on vast amounts of code, making them better suited to generating actions in code rather than in other formats like JSON.

Composability: Code can be nested and reused, similar to defining functions in programming. This contrasts with JSON, where such nesting and reusability are much harder to implement.

Object Management: Storing the outputs of actions (like an image from an image generator tool) is more easily managed with code. In JSON, this is very difficult.

Generality: Code is designed to express any action a computer can perform. This makes it a highly general and flexible way to define actions.

Contrast with JSON-based Actions: In contrast to code-based actions, other agents like the ReactJsonAgent in the Transformers Agents 2.0 framework, generate actions in JSON format. While JSON is a common way to represent data, smolagents argues that it’s not ideal for representing executable actions due to limitations in the areas described above.

While the arguments are compelling, I don’t think the jury is out on this. As every, as a builder you need to apply nuance and look at your specific context. I can imagine multiple situations where a layer of decoupling between the definition of what to do and the actual execution is useful, especially when dealing with more complex organisational systems. Nevertheless, it is certainly a worthwhile argument to make and explore. To help explore it here is an outlive of some of the disadvantages of this approach.

Disadvantages of Code-Based Actions

Complexity for Less Powerful LLMs: As the smolagent authors note , less powerful LLMs may struggle to generate good code. Xode-based actions rely on the LLM having a strong ability to generate correct and executable code, which may not always be the case, especially with smaller or less capable models. In contrast, JSON structures are simpler to generate, making them more suitable for LLMs that may not be as proficient at coding.

Potential for Errors and Security Risks: While smolagents uses sandboxed environments for security, the generation and execution of code introduce potential risks and error points that are not present when using simpler JSON or text-based actions. If the LLM generates code with errors, it can cause the agent to malfunction. Furthermore, the code could contain unintended or malicious actions if not properly sandboxed and validated.

Increased Cognitive Load on the LLM: Generating code, as opposed to JSON, could place a higher cognitive load on the LLM, which could lead to slower performance, particularly with complex tasks or less capable models.

Debugging and Transparency: While smolagents emphasizes simplicity, debugging issues with code-based actions can be more difficult than debugging JSON-based actions. Errors in the generated code could be harder to diagnose and fix compared to simpler action formats such as JSON, requiring more expertise and effort.

Limited Flexibility for Non-Programmers: While code provides generality, it also requires some degree of coding knowledge. This might make it less accessible for users without programming skills, potentially limiting the usage of code-based agents for non-programmers.

As you can see while having LLMs write and execute the code is exciting it does come at a cost.

What smolagents adds to our understanding of Agent Systems (for Thinkers)

For me the underlying framework is not really interesting from an agent systems perspective and, to be fair, it is not supposed to it! This is smolagents after all, not bigandcomplexagents with lots of opinions. The most interesting aspect is the code execution vs task definition part. If I draw some parallels to the three layers of agency I defined in a previous post, where you have emergent agency (coming from the LLM), structured agency (coming from a control framework) and perceived agency (the actual interface) what smolagents is doing is pushing more of the agency into the emergent part. It is relinquishing control and allowing the LLM to actually write and execute the code.

What smolagents means for organisations (for Product Owners, CxOs)

I have to admit a certain level of bias here. I work with organisations in regulated environments. I just cannot imagine these clients, that already have difficulty accepting basic LLM functionality, signing up to an agent framework where the agents write and execute their own code. In addition, there is a big step from what smolagents currently does to everything you need for an agent framework running a production-level application. As such it is hard for me to see this as anything other than an interesting experimental approach.

Having said that, I can imagine that some organisations with a bigger appetite for risk and their own specific knowledge and opinions around what else they want to build on top of smolagents could adopt this library as a great kernel from which to start.