Evaluating Deep Agents: Crafting Metrics for Success
In the evolving landscape of artificial intelligence, Deep Agents stand as a testament to the potential of open-source, model-agnostic systems. These agents, integral to products such as Fleet and Open SWE, rely on meticulously crafted evaluations (evals) to ensure accuracy and reliability. Designing effective evals is not merely a technical necessity but a strategic endeavor that shapes agent behavior and enhances their utility in real-world applications.
The Importance of Thoughtful Evaluation
Every evaluation serves as a vector that influences the behavior of a Deep Agent. For instance, if an evaluation focusing on efficient file reading fails, it signals the need to adjust the system prompt or the tool description until the desired behavior is achieved. This iterative process underscores the importance of selecting evaluations that genuinely reflect the behaviors desired in production environments.
While it might be tempting to amass a vast number of evaluations to create an illusion of improvement, more evaluations do not necessarily equate to better agents. Instead, the focus should be on building targeted evaluations that mirror the desired production behaviors.
Curating Data for Evaluations
The process of curating data for evaluations involves several strategies:
Feedback from Dogfooding: By using the agents internally (dogfooding), every encountered error becomes an opportunity to refine the agent through evaluations. This hands-on approach ensures that the evaluations are grounded in real-world usage and challenges.
Adapting External Benchmarks: Evaluations can also be sourced from established benchmarks such as Terminal Bench 2.0 or BFCL. These are often adapted to suit the specific characteristics and requirements of the agent being evaluated.
Handcrafted Evaluations: Writing custom evaluations and unit tests by hand ensures that specific, critical behaviors are tested and validated. This artisanal approach allows for a deeper understanding of the agent's capabilities and limitations.
Defining Metrics for Success
When defining metrics for agent evaluations, the primary consideration is correctness. An agent must reliably complete the tasks it is designed for. To this end, multiple models are tested against a set of evaluations, with metrics such as correctness, step ratio, and tool call ratio playing pivotal roles in assessing performance.
- Correctness: This metric evaluates whether the agent successfully completes the task.
- Step Ratio: This measures the number of agent steps taken versus the ideal number of steps, with a lower ratio indicating higher efficiency.
- Tool Call Ratio: This compares the observed tool calls to the ideal number, again with a lower ratio being preferable.
- Latency Ratio: This assesses the time taken to complete a task relative to the ideal time frame.
- Solve Rate: This measures how quickly an agent can solve a task, normalized by the expected number of steps.
By focusing on these metrics, teams can make informed decisions about which models to deploy based on their accuracy and efficiency.
Running Evaluations
To maintain consistency and reproducibility, evaluations are run using tools such as pytest with GitHub Actions in a continuous integration environment. This setup ensures that changes are tested in a clean, controlled setting, allowing for accurate assessment of the agent's performance.
Moreover, evaluations can be selectively run using tags, which is both cost-effective and targeted. For example, if an agent is designed to handle file operations and tool usage extensively, evaluations tagged with these categories can be prioritized.
Looking Ahead
The journey of refining Deep Agents is ongoing, with plans to expand the evaluation suite and explore the performance of open-source models against closed-source counterparts. The ultimate goal is to leverage evaluations not just as a measure of performance, but as a mechanism for real-time improvement.
Deep Agents, being fully open-source, invites collaboration and feedback from the community. As the landscape of AI continues to evolve, so too will the methods of evaluation, ensuring that Deep Agents remain at the forefront of innovation and utility. By carefully crafting and maintaining evaluations, we can ensure that these agents are not only intelligent but also aligned with the nuanced demands of their users.
Saksham Gupta
Founder & CEOSaksham Gupta is the Co-Founder and Technology lead at Edubild. With extensive experience in enterprise AI, LLM systems, and B2B integration, he writes about the practical side of building AI products that work in production. Connect with him on LinkedIn for more insights on AI engineering and enterprise technology.


