AutoAgent: Open Source Library for Autonomous AI Agent Optimization

Key Takeaways

  • Eliminates the manual, repetitive labor of prompt-tuning and agent configuration by automating the optimization loop.
  • Demonstrates that autonomous meta-agents can outperform human-engineered agent harnesses on complex benchmarks like SpreadsheetBench.
  • Shifts the role of AI engineers from manual coding to high-level direction, accelerating the development of high-performance agentic systems.

AI engineers often face a repetitive and time-consuming cycle of prompt-tuning: writing system prompts, running agents against benchmarks, analyzing failure traces, and manually tweaking configurations. A new open-source library called AutoAgent, developed by Kevin Gu at thirdlayer.inc, aims to eliminate this manual labor by allowing an AI to autonomously engineer and optimize its own agent harness. In a 24-hour run, the library demonstrated its efficacy by achieving the number one spot on SpreadsheetBench with a 96.5% score and securing the top GPT-5 score on TerminalBench at 55.1%.

Automating the Agentic Loop

AutoAgent functions similarly to Andrej Karpathy’s autoresearch, but it is specifically designed for agent engineering. While autoresearch iterates through cycles to improve machine learning training, AutoAgent applies this logic to the agent harness—the scaffolding that includes system prompts, tool definitions, routing logic, and orchestration strategies. By automating the propose-test-evaluate loop, the system allows an AI to modify its own configuration, run benchmarks, and decide whether to keep or discard changes based on performance improvements.
The architecture relies on a clear separation of concerns between the human and the machine. The human defines the goal in a program.md file, which serves as the directive for the meta-agent. The meta-agent then inspects the agent.py file, which contains the harness under test, and iteratively rewrites it to improve performance. A results.tsv file tracks the history of these experiments, allowing the meta-agent to learn from past attempts and calibrate future iterations.

Domain-Agnostic Optimization

The library is built to be domain-agnostic, utilizing the Harbor format for benchmarks. Each task includes a configuration file, instructions for the agent, and a test suite that can employ either deterministic checks or an LLM-as-judge to verify performance. Because these tasks run in Docker containers, AutoAgent can be applied to any scorable domain, from spreadsheet manipulation to terminal command execution.
This approach shifts the role of the AI engineer from a manual coder to a director. Instead of directly editing the agent harness, the engineer provides high-level guidance, leaving the technical optimization to the meta-agent. Observations from the project suggest that same-family model pairing—such as using a Claude meta-agent to optimize a Claude task agent—may lead to more accurate failure diagnosis, indicating that the relationship between the meta-agent and the target agent is a significant factor in the optimization process.

Comments (0)

No comments yet

Be the first to share your thoughts!