Ask a personal agent to do something boring — book a flight, move money between accounts, reply to a customer, generate an invoice — and watch how it actually gets there. In the dominant pattern today, the agent synthesizes its behavior on the fly: it reads your prompt, surveys the tools it can reach, and improvises a chain of calls across inboxes, calendars, files, payment APIs, and message threads. Most of the time it works. The trouble is that "most of the time" is a terrible standard for the small set of tasks where a wrong email address, a misread identity, or a poisoned message turns a routine action into a real loss.
A May 2026 arXiv paper from researchers at Columbia and Google, Engineering Robustness into Personal Agents with the AI Workflow Store, gives this problem a clean name and a sharper diagnosis. The authors argue that on-the-fly synthesis routinely bypasses the rigorous engineering process that made conventional software reliable in the first place. Their booking example is deliberately mundane, and that is the point: it surfaces failure modes like selecting the wrong recipient from an ambiguous set of contacts, resolving the wrong identity, ingesting a malicious prompt injected through an untrusted channel, and operating with tool access far broader than the task requires. None of these are exotic. They are the predictable cost of letting an agent reinvent its approach every single time.
A workflow store is not a prompt library
The instinct, when an agent misbehaves, is to write more rules into the prompt. Add a guardrail. Add a clarifying instruction. Add a warning about prompt injection. This scales badly, and Anthropic's guidance on context engineering explains why. They define context engineering as the curation and maintenance of the tokens available during inference — system prompts, tools, message history, retrieved documents, tool outputs, memory, and runtime state — and they frame context as a finite resource. A bigger window does not rescue you; model performance can degrade as context grows, and rules in a prompt compete for the model's attention with everything else you have stuffed in there. The recommendation is to find the smallest high-signal set of tokens that maximizes the chance of the outcome you want. "Bigger prompt" is precisely the wrong default.
The AI Workflow Store proposes the alternative: a collection of hardened, reusable workflows that an agent invokes instead of improvising a fresh tool chain for each request. Crucially, this is not "no LLMs." The paper moves LLM reasoning into a narrower, bounded role — selecting the right workflow, extracting parameters, and executing under monitoring — rather than asking the model to be the architect, implementer, and operator all at once. Robustness, in this framing, is an engineered property you achieve through process, not a quality bestowed by a sufficiently clever model or prompt. The proposal rests on two bets worth stating plainly: that AI-assisted engineering can produce workflows more reliable and secure than pure improvisation, and that the cost of building those workflows is amortized through reuse.
What actually goes in a workflow
If you have shipped real software, the shape of this will feel familiar, which is the strongest argument in its favor. A workflow for "schedule a payment to a known vendor" is not a paragraph of instructions. It is a package:
- Typed inputs — vendor, amount, account, date — not free-text the model guesses at.
- An allowlist of tools and the permissions scoped to exactly what the task needs, nothing more.
- Confirmation points where a human approves the consequential step before it fires.
- Eval cases and adversarial tests, including the injection attempt that arrives disguised as a vendor email.
- Logging and telemetry so you can see what ran, with what parameters, and what happened.
- An owner, a version, and a deprecation or rollback policy so the thing can be maintained rather than silently rotting.
That list is just software engineering applied to agent behavior. The reason it feels novel is that the current agent stack quietly dropped most of it in exchange for the magic of runtime synthesis.
Why this is timely, not academic
Agent reliability is not a thought experiment for some future product. LangChain's State of Agent Engineering survey, run across 1,340 professionals in late 2025, found 57.3% of respondents already running agents in production and another 30.4% actively building toward it. The most-cited barrier to production was quality, named by roughly 32%. There is also a revealing asymmetry in tooling: nearly 89% reported observability practices, while only about 52% reported evaluation. Teams can watch their agents, but far fewer can systematically verify them.
That gap is exactly where a workflow store earns its keep. If quality is the top blocker and you can already see what your agents do but struggle to test them, then the practical move is to constrain the task paths that matter most into workflows you can test, version, and observe repeatedly. You don't have to make every interaction rigorous. You have to make the consequential, recurring ones rigorous.
The build path
Resist the urge to convert everything. Start with an inventory of recurring tasks and rank them by volume and risk. The first workflows to harden are the ones that move money, touch customer data, or run often enough that a small error rate compounds into a large problem — paying vendors, triaging inbound email, filing support tickets, issuing refunds. Design narrow permissions for each, write the adversarial tests before you trust the happy path, and roll out to a slice before granting broad autonomy. Keep open-ended agent planning alive, but as a deliberate exception lane for genuinely novel work and supervised one-offs, not as the default execution mode for your highest-stakes actions.
Where it can go wrong
A workflow store is not a safety guarantee, and pretending otherwise just relocates the false confidence. Workflows go stale as the systems around them change. A shared workflow library becomes a supply chain, with all the trust problems that implies — a popular, subtly compromised workflow is a monoculture failure waiting to happen. Marketplace dynamics can reward workflows that are widely adopted over workflows that are actually correct. And over-packaging is its own failure: force genuinely novel work through rigid procedures and you get an agent that is reliable at the wrong things. The discipline cuts both ways. Workflows need owners, reviews, and deprecation as much as any other production dependency.
The shape of the durable stack
The agent that holds up under real use will not look like a single brilliant prompt. It will look like software: packages with typed interfaces, tests that include the adversarial cases, permissions scoped to intent, telemetry you actually read, and a maintenance story for when the world shifts underneath it. The model still does the reasoning that models are uniquely good at — understanding what you meant, picking the right procedure, filling in the parameters. It just stops being asked to improvise the engineering, too. For the work that matters, that trade is worth making.
Sources
- [Engineering Robustness into Personal Agents with the AI Workflow Store](https://arxiv.org/abs/2605.10907)
- [Effective context engineering for AI agents — Anthropic](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [State of Agent Engineering — LangChain](https://www.langchain.com/state-of-agent-engineering)
Sources
- https://arxiv.org/abs/2605.10907
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.langchain.com/state-of-agent-engineering
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->