The Alchemic Blog

Practical guides, research breakdowns, and real-world insights on AI agents, automation, and the tools shaping how we work.

TABQWORLD: Teaching AI to Actually Read Tables

TABQWORLD: Teaching AI to Actually Read Tables

A training-free framework from UCLA, McGill, and HKUST dynamically switches between visual and textual table representations on the fly — achieving 4.87% better accuracy while cutting inference latency by a third.

Your RAG Pipeline Might Be Holding Your Reasoning Model Back

Your RAG Pipeline Might Be Holding Your Reasoning Model Back

A new preprint suggests document RAG can hurt reasoning models on hard benchmarks. Procedural retrieval from a 32M-recipe memory boosts accuracy by up to 19.2% in the paper's tested settings, with no fine-tuning.

Your AI Agent Is Being Played — And It Doesn't Even Know It

Your AI Agent Is Being Played — And It Doesn't Even Know It

A new paper introduces Session Risk Memory (SRM), a lightweight module that detects distributed multi-turn attacks on AI agents by tracking behavioral drift — with perfect F1 and zero false positives.

ProMAS: Catching Multi-Agent Errors Before They Cascade

ProMAS: Catching Multi-Agent Errors Before They Cascade

ProMAS introduces proactive error forecasting for LLM-based multi-agent systems using Markov transition dynamics, detecting reasoning failures before they propagate by monitoring semantic velocity.

OS-Themis: Teaching GUI Agents to Judge Their Own Work

OS-Themis: Teaching GUI Agents to Judge Their Own Work

OS-Themis introduces a multi-agent critic framework that decomposes GUI agent evaluation into milestone verification and verdict calibration, achieving 18.8% accuracy gains over baselines for RL-trained agents.

Helium: What If Your Agent Framework Had a SQL Optimizer?

Helium: What If Your Agent Framework Had a SQL Optimizer?

A new paper introduces Helium, a workflow-aware LLM serving framework that treats agentic workloads like database query plans. Up to 1.56x speedup by eliminating redundant compute across chained LLM calls.