The Alchemic Blog

Practical guides, research breakdowns, and real-world insights on AI agents, automation, and the tools shaping how we work.

The Agent Didn't Finish Just Because It Said It Did

The Agent Didn't Finish Just Because It Said It Did

A new arXiv paper proposes learning essential agent workflow states from a few passing traces, then validating future runs against those milestones instead of trusting self-reports.

SFT Might Be the Bottleneck in Your Post-Training Pipeline

SFT Might Be the Bottleneck in Your Post-Training Pipeline

A Zhejiang University preprint argues that standard SFT can make downstream RL less effective, and proposes Group Fine-Tuning as a more stable bridge between imitation and reward-based post-training.

The Bottleneck in Agent Evals Isn't Success Rate, It's Exploration

The Bottleneck in Agent Evals Isn't Success Rate, It's Exploration

A new arXiv preprint introduces a policy-agnostic way to measure exploration and exploitation errors from agent trajectories alone, and finds that exploration failure, not exploitation, is what separates strong agents from weak ones.

TABQWORLD: Teaching AI to Actually Read Tables

TABQWORLD: Teaching AI to Actually Read Tables

A training-free framework from UCLA, McGill, and HKUST dynamically switches between visual and textual table representations on the fly — achieving 4.87% better accuracy while cutting inference latency by a third.

Your RAG Pipeline Might Be Holding Your Reasoning Model Back

Your RAG Pipeline Might Be Holding Your Reasoning Model Back

A new preprint suggests document RAG can hurt reasoning models on hard benchmarks. Procedural retrieval from a 32M-recipe memory boosts accuracy by up to 19.2% in the paper's tested settings, with no fine-tuning.

Your AI Agent Is Being Played — And It Doesn't Even Know It

Your AI Agent Is Being Played — And It Doesn't Even Know It

A new paper introduces Session Risk Memory (SRM), a lightweight module that detects distributed multi-turn attacks on AI agents by tracking behavioral drift — with perfect F1 and zero false positives.

ProMAS: Catching Multi-Agent Errors Before They Cascade

ProMAS: Catching Multi-Agent Errors Before They Cascade

ProMAS introduces proactive error forecasting for LLM-based multi-agent systems using Markov transition dynamics, detecting reasoning failures before they propagate by monitoring semantic velocity.

OS-Themis: Teaching GUI Agents to Judge Their Own Work

OS-Themis: Teaching GUI Agents to Judge Their Own Work

OS-Themis introduces a multi-agent critic framework that decomposes GUI agent evaluation into milestone verification and verdict calibration, achieving 18.8% accuracy gains over baselines for RL-trained agents.

Helium: What If Your Agent Framework Had a SQL Optimizer?

Helium: What If Your Agent Framework Had a SQL Optimizer?

A new paper introduces Helium, a workflow-aware LLM serving framework that treats agentic workloads like database query plans. Up to 1.56x speedup by eliminating redundant compute across chained LLM calls.