TED: What If Knowledge Distillation Didn't Need Training At All?

A new framework from China Telecom's AI lab shows that you can transfer a teacher model's reasoning ability to a student model by editing prompts — not weights. The results are surprisingly competitive with traditional distillation at 1/23rd the cost.

The Problem Nobody Talks About

Knowledge distillation has a dirty secret: it's expensive. Not "need a bigger GPU" expensive — more like "rent a cluster for three days and hope your hyperparameters are right" expensive.

The standard recipe goes like this: take a powerful teacher model, generate a mountain of training data, then fine-tune your smaller student model on that data until it absorbs the teacher's capabilities. It works. It also costs hundreds of dollars in compute, requires access to model weights (sorry, API-only users), and demands large labeled datasets.

For teams running models on edge devices, using black-box APIs, or simply not wanting to retrain every time they need a capability boost — this is a non-starter.

A group of researchers at China Telecom Digital Intelligence asked a deceptively simple question: what if the "student" could learn by updating its context window instead of its weights?

Their answer is TED (Training-Free Experience Distillation), and it might reshape how we think about making small models smarter.

What TED Actually Does

TED replaces gradient-based weight updates with something almost absurdly elegant: it builds a persistent "experience library" that gets injected into the student model's system prompt.

Here's the pipeline:

Step 1 — Trajectory Generation. For each training problem, the student model generates multiple reasoning attempts (typically 5). The teacher model independently solves the same problem. Both sets of raw reasoning traces get compressed into clean, structured chains: Premises → Step 1 → Step 2 → Conclusion.

Step 2 — Experience Extraction. The teacher model plays critic. It examines the student's attempts alongside its own correct solution and the ground truth, then extracts generalized lessons — not "here's how to solve this specific problem" but "here's a reasoning pattern that works" or "here's a failure mode to avoid." These become discrete experience items stored in a growing library.

Step 3 — Experience Compression. This is the clever bit. Naive accumulation would bloat the context window into uselessness. TED tracks usage statistics for every experience item — how often each one gets referenced during subsequent reasoning. When the library exceeds a token budget (4,000 tokens, 15 items max), the teacher selectively merges redundant experiences, rewrites vague ones, and prunes items that never get used.

At inference time, the compressed experience library is simply prepended to the system prompt. The student model's weights never change. The "learning" lives entirely in the prompt.

The Numbers

Let's cut to what matters.

Multimodal Math (MathVision)

Model	Direct Inference	TED (100 samples)	Naive KD (3,940 samples)
Qwen3-VL-8B	0.627	0.702 (+12%)	0.764
Qwen3-VL-235B	0.746	0.762 (+2.1%)	—

Visual Logic (VisualPuzzles)

Model	Direct Inference	TED (100 samples)	Naive KD
Qwen3-VL-8B	0.517	0.561 (+8.5%)	0.583
Qwen3-VL-235B	0.572	0.579 (+1.2%)	—

Text-Only Math (AIME 2025)

Model	Direct Inference	TED (100 samples)	Naive KD
Qwen3-8B	0.673	0.733 (+8.9%)	0.791
Qwen3-235B	0.815	0.846 (+3.8%)	—

The pattern is consistent: TED gives the biggest lift to smaller models (8B range) where the capacity gap to the teacher is largest. The 235B models still benefit, just less dramatically — they already have strong reasoning representations; the experience library adds refinement rather than fundamental capability.

Key insight: The biggest gains come from smaller models where the teacher-student capacity gap is largest. Larger models already have strong reasoning — the experience library adds refinement.

Cost: This Is Where It Gets Interesting

The researchers ran a head-to-head cost comparison for Qwen3-VL-8B on MathVerse with 100 training samples:

Method	Infrastructure	Time	Cost
Naive KD	8× NVIDIA A800 GPUs	3 days	~$288
TED	API calls only	8 hours	~$12.60

That's a 22.9× cost reduction. TED requires no GPUs at all — just API access to the student and teacher models. The entire "training" process is a series of inference calls: generate trajectories, extract experiences, compress, repeat for 3 epochs over 100 samples.

For context: $12.60 buys you a ~12% accuracy boost on MathVision for an 8B model. That's the cost of two fancy coffees.

Why This Matters More Than the Numbers Suggest

1. It Works on Black-Box Models

Most distillation requires access to model internals — logits, gradients, or at minimum the ability to fine-tune. TED needs nothing but the ability to prompt the model and read its outputs. This means you can distill knowledge into API-only models like Claude or GPT — something that's been effectively impossible with traditional KD.

2. Experiences Transfer Across Modalities

Here's a surprising finding from the ablation studies: experience learned from multimodal tasks (MathVerse) transfers to text-only benchmarks (AIME25), improving Qwen3-8B from 0.673 to 0.686. And the reverse works too — text-only experience (from DAPO-Math) improves the multimodal MathVision score from 0.627 to 0.692.

The experiences TED extracts aren't modality-specific tricks. They're abstract reasoning patterns — "check your arithmetic before concluding," "when two approaches give different answers, verify the assumptions," that kind of meta-cognitive guidance. The fact that these transfer across vision and language tasks suggests TED is capturing something genuinely general about reasoning.

3. Compression Is Non-Negotiable

The ablation on experience compression is striking: without it, TED actually hurts performance (dropping from 0.627 to 0.594 — below the no-experience baseline). Naive truncation recovers some ground (0.648), random selection does worse (0.632), but only the full teacher-guided compression reaches 0.702.

Key insight: Naive experience accumulation is worse than no experience at all. The teacher-guided compression mechanism — tracking usage, merging redundancies, pruning deadweight — is the most important contribution of the paper.

This confirms a pattern we've seen across the RAG and prompt engineering literature: more context isn't always better. The quality of what's in your context window matters enormously. TED's compression mechanism is arguably the most important contribution of the paper.

4. Stronger Teachers = Better Experiences

The teacher model quality ablation shows a clean gradient:

Student as its own teacher (Qwen3-VL-8B → 8B): 0.656
DeepSeek as teacher: 0.681
Kimi-K2.5 as teacher: 0.702
ChatGPT-5.2 as teacher: 0.719

This has practical implications: as frontier models get cheaper to query, TED's cost-effectiveness improves. You're essentially converting expensive API calls during a one-time "experience extraction" phase into permanent, reusable reasoning improvements.

How It Works Under the Hood

The mathematical formulation is clean. Traditional KD optimizes parameters θ:

θ ← θ - η∇θL_KD

TED optimizes an experience set E instead:

E ← Update(E; x, y, {τᵢ}, τ_T, {rᵢ})

At inference, the student's prompt becomes [system_prompt; E; input]. The experience set E is the only thing that changes between "untrained" and "distilled" — the model itself is identical.

The priority scoring for experience compression uses a logarithmic utility function: s(e) = log(1 + usage_count(e)). This naturally favors frequently-used experiences while preventing any single high-count item from dominating. When compression triggers (context exceeds budget), the teacher performs four possible operations: merge similar items, rewrite for generality, delete low-utility items, or retain as-is.

Limitations (Being Honest)

TED has real constraints:

Performance ceiling. With enough data, traditional KD outperforms TED. At 3,000 training samples, Naive KD reaches 0.764 on MathVision vs. TED's ~0.71 plateau. TED is best positioned as a low-data, low-cost alternative — not a universal replacement.

Context window dependency. The experience library consumes ~4,000 tokens of context. For models with small context windows or tasks requiring maximum context for the input itself, this overhead matters.

Teacher quality matters a lot. Using a weak teacher (the student itself) gives 0.656 — meaningful, but modest. The headline results require a strong teacher like Kimi-K2.5 or ChatGPT-5.2. This creates a dependency on expensive frontier models, even if only during the experience extraction phase.

Single benchmark family. The evaluations focus on mathematical and logical reasoning. Whether TED's experience mechanism generalizes to code generation, creative writing, or open-domain QA remains untested.

What This Means for Practitioners

If you're running a smaller model (8B–30B range) and want to squeeze more reasoning performance out of it without retraining:

Pick a strong teacher. Use the best frontier model you can afford for the extraction phase. This is a one-time cost.
Curate your training samples. TED works with as few as 100 examples. Quality over quantity — pick representative problems from your target domain.
Don't skip compression. The paper proves this definitively: uncompressed experience accumulation is worse than no experience at all.
Consider cross-domain transfer. If you've already built experience libraries for one task, try them on adjacent tasks. The cross-modal transfer results suggest this could work.

Our Take

TED is one of those papers that makes you think "why didn't someone try this sooner?" The idea of accumulating reasoning experiences in prompts instead of weights has been gestured at by projects like Reflexion and Memento, but TED is the first to formalize it as a complete distillation framework with proper compression and utility tracking.

The 22.9× cost reduction is the headline, but the real innovation is the compression mechanism. It solves the fundamental scaling problem of in-context learning: you can't just keep adding more context forever. TED's teacher-guided compression — tracking what's useful, merging redundancies, pruning deadweight — is a general solution that could be applied to any system that maintains a growing prompt library.

The biggest open question is whether this scales to non-mathematical reasoning. Math problems have clean ground-truth signals that make teacher critique straightforward. Domains with fuzzier success criteria (creative writing, open-ended analysis, code quality) might not provide the same clear feedback loop. But for anyone working in structured reasoning tasks today — math, logic, code, factual QA — TED offers a genuinely practical way to make small models significantly smarter without training anything at all.

OpenClaw Field Guide

The practical guide to building, deploying, and running AI agents on your own infrastructure. Covers multi-model routing, cron automation, sub-agent teams, security hardening, and more.

Get the Field Guide — $10 →

"TED: Training-Free Experience Distillation for Multimodal Reasoning" by Shuozhi Yuan, Jinqing Wang, Zihao Liu, Miaomiao Yuan, Haoran Peng, Jin Zhao, Bingwen Wang, and Haoyi Wang (China Telecom Digital Intelligence / Institute of Computing Technology, CAS). Published March 25, 2026. https://arxiv.org/abs/2603.26778

TED: What If Knowledge Distillation Didn't Need Training At All?

The Problem Nobody Talks About

What TED Actually Does

The Numbers

Multimodal Math (MathVision)

Visual Logic (VisualPuzzles)

Text-Only Math (AIME 2025)

Cost: This Is Where It Gets Interesting

Why This Matters More Than the Numbers Suggest

1. It Works on Black-Box Models

2. Experiences Transfer Across Modalities

3. Compression Is Non-Negotiable

4. Stronger Teachers = Better Experiences

How It Works Under the Hood

Limitations (Being Honest)

What This Means for Practitioners

Our Take

OpenClaw Field Guide

Keep Reading