Back to AI Research

AI Research

Compiling Agentic Workflows into LLM Weights: Near-... | AI Research

Key Takeaways

  • Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost This paper explores a new way to build AI agents by...
  • Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex.
  • All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn.
  • Yet developer adoption has overwhelmingly favored orchestration.
  • We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).
Paper AbstractExpand

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
This paper explores a new way to build AI agents by moving away from "orchestration frameworks"—the popular tools that act as a middleman between a user and an AI. Currently, frameworks like LangGraph or CrewAI manage agent behavior by injecting instructions and routing decisions at every step of a conversation. The authors propose a "subterranean" approach: instead of using an external manager, they compile the entire procedural workflow directly into the weights of a smaller, fine-tuned model. This allows the model to "self-orchestrate" naturally, eliminating the need for complex external logic at runtime.

How the Subterranean Approach Works

The process begins by defining an agent’s workflow as a flowchart with specific nodes and decision points. The researchers then generate thousands of synthetic conversations that follow every possible path through that flowchart. By fine-tuning a smaller model on this data, the procedure becomes part of the model’s internal knowledge rather than a set of instructions it has to read repeatedly. At runtime, the user simply talks to the model, which has learned to follow the workflow through its own internal statistical patterns.

Performance and Quality

The researchers tested this method across three domains: travel booking, Zoom technical support, and insurance claims. They found that an 8B-parameter model trained this way achieves 87–98% of the quality of a "frontier" model (like Claude Sonnet 4.5) that has the entire procedure provided in its prompt. In many cases, the compiled model actually outperformed the standard orchestration frameworks, particularly in consistency and naturalness. Because the model has internalized the workflow, it avoids the common "routing errors" that occur when an external orchestrator tries to decide which step to take next.

Significant Cost and Speed Advantages

One of the most striking findings is the massive reduction in cost and latency. Compiled models are 128–462 times cheaper per conversation than the standard in-context baseline. This is because the model no longer needs to process long, repetitive procedural instructions in every API call, and it can be self-hosted on smaller hardware. Furthermore, the "recompile" cycle—the time it takes to update the model when a procedure changes—takes only 30–50 minutes. This makes the approach a viable part of a standard software development lifecycle (CI/CD) rather than a slow, one-off research project.

Why This Matters

The paper argues that the industry’s preference for orchestration frameworks is based on perceived barriers that are actually quite manageable. While developers often worry that fine-tuning is too rigid or that small models aren't smart enough, this research shows that for procedural tasks, smaller models are highly effective. The authors conclude that persistent structure—the "rules" of a task—belongs in the model's weights, while the transient details of a specific conversation belong in the prompt. By shifting the workflow into the model itself, developers can create faster, cheaper, and more reliable agents.

Comments (0)

No comments yet

Be the first to share your thoughts!