Imagine describing an energy simulation in a single sentence — "Run a reinforcement learning experiment on a 10-building cluster connected to the IEEE 33-bus grid, using a grid-aware reward, and report voltage compliance metrics" — and having an AI agent handle everything else. No framework configuration. No boilerplate code. No debugging obscure module dependencies at midnight. That's the core promise of AutoB2G, a new framework published in March 2026 by researchers at the ACM BuildSys 2026 conference.
It's a paper about energy systems on the surface. But underneath, it's a compelling demonstration of where multi-agent AI is headed: systems that can navigate complex, dependency-heavy software environments through structured reasoning — and iteratively fix their own mistakes.
The Problem: Simulation Is Stuck in the Past
Reinforcement learning (RL) has enormous potential for smart building energy management. Buildings account for a massive share of grid electricity demand, and RL agents can learn to optimize HVAC schedules, battery storage dispatch, and EV charging in ways rule-based systems simply can't match.
But here's the problem: the simulation environments researchers use to train and test these RL agents were never built with the grid in mind. Tools like CityLearn, Sinergym, and Energym are excellent at modeling building-side behavior — energy cost, comfort, peak demand. What they don't do well is model how those buildings collectively affect the power grid.
And that gap matters. A building cluster that looks great from an energy-cost perspective might be quietly destabilizing voltage levels on the local distribution network. You'd never know it from the standard simulation metrics.
The previous solution: GridLearn patched CityLearn V1 with grid models — but it became incompatible with CityLearn V2 and its newer features (EV integration, thermal dynamics modeling, custom datasets). AutoB2G picks up where GridLearn left off, building a proper B2G co-simulation on top of the latest CityLearn V2.
There's also a secondary problem: even if you had a good simulation environment, setting one up for a specific research scenario requires substantial programming expertise. You need to understand the framework internals, configure modules correctly, wire together interfaces from multiple tools (CityLearn, EnergyPlus, Pandapower), and debug inevitable integration failures. That's a high barrier that keeps many researchers from exploring grid-aware control strategies at all.
AutoB2G: The Architecture
AutoB2G solves both problems simultaneously. It extends CityLearn V2 to support Building-to-Grid (B2G) interaction, and then layers an LLM-based agentic system on top that can construct, execute, and debug simulation workflows from natural language alone.
The framework has three major components working in concert:
1. The B2G Simulation Environment
AutoB2G integrates CityLearn V2 with Pandapower — a widely used Python library for power system analysis — to create a true co-simulation environment. At each simulation timestep, building electricity demands from CityLearn are mapped onto buses in the Pandapower network model. Power flow analysis runs to compute grid state. Grid variables like bus voltage magnitudes then feed back into the RL agent's observation space, enabling genuinely grid-aware decision-making.
This bidirectional coupling matters. The RL agent isn't just optimizing for electricity cost anymore — it can see and respond to grid conditions in real time.
AutoB2G also introduces a new grid-side evaluation suite, covering:
- Voltage admissibility — are bus voltages staying within safe per-unit limits?
- Thermal loading limits — are distribution lines being overloaded?
- N-1 resilience — can the network keep operating if any single line fails?
- Short-circuit current analysis — do demand-side control strategies inadvertently increase fault current magnitudes to dangerous levels?
That last one is subtle and important. RL-optimized building control could theoretically push the grid into configurations that compromise protection coordination — a risk that building-focused metrics would completely miss.
2. DAG-Based Agentic Retrieval
Here's where the paper gets interesting from an AI systems perspective. The central challenge for any LLM trying to generate simulation code is that it has no inherent knowledge of the framework's internals — which modules exist, what order they need to run in, which functions depend on which outputs.
AutoB2G addresses this with a Directed Acyclic Graph (DAG)-structured codebase. Every callable simulation function is represented as a node in the DAG. Edges encode dependencies — if function B requires an output from function A, there's an edge from A to B. Each node carries metadata: its inputs and outputs, its stage in the pipeline, whether it's mandatory or optional, and a natural-language description of what it does.
When a user submits a task description, a retrieval agent uses this DAG to identify the relevant modules and construct a valid execution sequence. If the proposed sequence violates any dependency constraints, a validator returns structured feedback and the agent iteratively repairs its proposal until all dependencies are satisfied.
Why DAGs beat plain RAG here: Standard retrieval-augmented generation can surface relevant code snippets — but it can't guarantee those snippets compose into a valid executable pipeline. The DAG provides structural constraints that enforce correctness, not just relevance. The agent isn't just finding code; it's finding code that fits together.
3. SOCIA: The Multi-Agent Code Generation Loop
Once the retrieval agent produces a validated module template, the SOCIA framework (Simulation Orchestration for Computational Intelligence with Agents) takes over. SOCIA treats the simulation program itself as an optimization variable and improves it through iterative cycles of generation, execution, and repair.
SOCIA coordinates five specialized agents:
| Agent | Responsibility |
|---|---|
| Workflow Manager | Receives the natural-language task; orchestrates the overall pipeline |
| Code Generator | Synthesizes and patches simulation code using the retrieved template |
| Simulation Executor | Runs the generated code; monitors compilation, interface conformance, runtime stability |
| Result Evaluator | Checks constraint violations; computes loss signal from execution results |
| Feedback Generator | Converts violations into targeted natural-language repair directives |
The feedback loop uses a technique called Textual Gradient Descent (TGD) — a conceptual analog to backpropagation, but operating over programs rather than numerical weights. The Feedback Generator produces what the paper calls a "textual gradient": a structured natural-language description of exactly which constraints were violated, which code segments are responsible, and what minimal changes would fix them. The Code Generator applies those changes, and the cycle repeats until the program passes all constraints.
It's a remarkably clean formulation. Rather than asking an LLM to get complex code right on the first try — a notoriously unreliable approach — SOCIA turns code generation into a constrained optimization problem with iterative convergence.
Experimental Results: Grid-Aware RL Works
The researchers tested AutoB2G on a building cluster connected to the IEEE 33-bus distribution network, a standard benchmark for power systems research. They compared RL agents trained with building-only rewards against agents trained with grid-aware rewards that explicitly penalized voltage deviations.
The results show that grid-aware control strategies measurably improve grid-side performance metrics — reducing voltage violations, improving N-1 resilience scores, and keeping line loading within thermal limits — while the automation framework successfully generated and executed the simulation workflows from natural language without manual coding intervention.
AutoB2G demonstrated that the DAG-based retrieval mechanism consistently produced structurally valid execution pipelines, resolving dependency conflicts through the iterative repair loop. The SOCIA framework converged to executable simulation code across diverse task configurations.
Why This Matters Beyond Energy
AutoB2G is nominally a paper about building energy and power systems. But the architectural patterns it demonstrates apply broadly to any domain where AI needs to generate and execute complex workflows in specialized software environments.
The combination of DAG-structured knowledge representation + agentic retrieval + iterative code refinement is a general solution to a general problem: LLMs hallucinate when they don't know a framework's internals, and they fail silently when generated code has structural errors. AutoB2G's approach — encode dependencies explicitly, validate structurally, and iterate with feedback — sidesteps both failure modes.
The pattern in one sentence: Don't ask an LLM to generate complex code from scratch. Give it a dependency graph, a retrieval mechanism, and a self-repair loop. Then let it iterate to correctness.
This is the same pattern you see emerging in production AI agent systems — from coding assistants that run tests and fix failures, to workflow automation platforms that validate API schemas before executing integrations. The specifics differ; the architecture is converging.
Limitations Worth Noting
The paper is honest about its constraints. AutoB2G was evaluated on a single distribution network benchmark (IEEE 33-bus). Real-world grid complexity — transmission networks, meshed topologies, diverse protection schemes — adds dimensions that weren't tested. The DAG-structured codebase also requires upfront engineering investment: someone has to model the dependencies correctly before the agents can use them. And like all multi-agent pipelines, SOCIA's iterative repair loop has a compute cost that scales with task complexity.
The authors flag that extending to multi-building, multi-feeder scenarios and supporting more heterogeneous grid configurations are the next steps. There's also the question of how well TGD-based repair handles truly novel errors that fall outside the constraint categories the Feedback Generator was designed around.
The Bigger Picture
We're watching a pattern crystallize across AI research: the shift from LLMs as single-shot code generators to LLMs as components in self-correcting, constraint-aware agent loops. AutoB2G is one clear data point in that trend — and the energy domain is a high-stakes proving ground. Mistakes in grid simulation don't just produce wrong answers; they lead to control policies that could destabilize real infrastructure.
The fact that an autonomous multi-agent system can navigate that complexity — retrieving the right modules, wiring them correctly, and iterating until the simulation runs — suggests these architectures are maturing faster than most people expected.
Key Takeaways
- AutoB2G extends CityLearn V2 with real power grid models, enabling RL training that optimizes for both building efficiency and grid safety.
- A DAG-structured codebase gives the LLM agent structural knowledge of module dependencies — not just code snippets, but the correct execution order.
- SOCIA's Textual Gradient Descent loop turns code generation into iterative constraint satisfaction, dramatically improving reliability over one-shot generation.
- The architectural pattern (structured knowledge + agentic retrieval + self-repair) is broadly applicable beyond energy simulation.
- Paper: arXiv:2603.26005 — accepted at ACM BuildSys 2026, Banff, Alberta (June 2026).
Build Your Own AI Agent System
The OpenClaw Field Guide covers multi-agent orchestration, agentic workflows, and production deployment patterns — the same architecture behind AutoB2G, in a format you can actually run today.
Get the Field Guide — $10 →