Skill-SD

arXiv Preprint · 2026

Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Hao Wang^*†1,5, Guozhi Wang^*‡5, Han Xiao^*2, Yufeng Zhou⁵, Yue Pan⁵, Jichao Wang¹, Ke Xu³, Yafei Wen⁵, Xiaohu Ruan⁵, Xiaoxin Chen⁵, Honggang Qi^§4

¹ Hangzhou Institute for Advanced Study, UCAS ² The Chinese University of Hong Kong ³ University of Science and Technology of China ⁴ University of Chinese Academy of Sciences ⁵ vivo AI Lab ^*Equal contribution ^§Corresponding author ^‡Project lead ^†Intern at vivo

📄 Paper 💻 Code Coming Soon

Abstract

Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%).

The helmsman sets the bearing.
The officer reads the wind.

Task reward from GRPO determines the overall direction; the skill-conditioned teacher supplies fine-grained, token-level guidance for the decisions in between.

Contributions

Key Contributions

Dynamic Skill Summaries

Each completed trajectory is asynchronously summarized into a structured skill: success patterns, mistake analysis, and a golden workflow.

Teacher-Only Guidance

Skills augment the teacher's prompt only. The student always operates under a clean task prompt, eliminating train-test mismatch.

Gradient-Correct Distillation

An importance-weighted reverse-KL loss corrects per-token gradient bias caused by teacher and student distribution mismatch.

Method

How it works

01

Plain-Prompt On-Policy Rollouts

The student generates rollouts using only the task prompt — no distilled skills — preserving identical train/test conditioning.

02

Trajectory-to-Skill Distillation

An auxiliary LLM compresses each episode into a reusable skill summary of successes, failures, and workflow.

03

Teacher-Only Skill Replay

Retrieved skills go only to the teacher, which re-scores the trajectory token by token — student inputs stay unchanged.

04

Joint RL + Distillation

GRPO handles trajectory-level reward; importance-weighted reverse-KL distills token-level guidance and corrects teacher-student mismatch.

Pipeline

Method Overview

Appendix Insight

What a Skill Looks Like

Skill-SD does not archive full trajectories as supervision. Each completed attempt is compressed into a compact teacher-only JSON artifact that records what to reuse, what to avoid, and what the next best rollout should do.

Teacher-Only Prompt Distilled after a completed attempt, injected only during distillation.

skill.json

{
"success_analysis": "Using task-specific apps and authenticating restricted APIs first is the right strategy.",
"mistake_analysis": "The main failure came from acting on unverified assumptions and calling restricted APIs before checking authentication and parameter requirements.",
"golden_workflow": "1. Retrieve the actual bill from the file system. 2. Get roommate contact info via the contact app. 3. Authenticate Venmo, compute each share, and send the payment requests with the correct API calls."
}

Benchmark Results

Results on AppWorld & Sokoban

Model: Qwen3-4B-Instruct-2507. Subscripts denote absolute change from the base model.

Method	AppWorld		Sokoban		Avg.
Method	Acc.	Comp.	Acc.	Comp.	Acc.	Comp.
Base Model	8.8%	39.1%	12.5%	32.0%	10.6%	35.6%
Vanilla OPD	22.8%_+14.0	59.7%_+20.6	21.9%_+9.4	37.5%_+5.5	22.4%_+11.7	48.6%_+13.0
Vanilla GRPO	50.9%_+42.1	76.3%_+37.2	51.6%_+39.1	68.8%_+36.8	51.2%_+40.6	72.5%_+36.9
Skill-Augmented GRPO	42.1%_+33.3	76.1%_+37.0	20.3%_+7.8	37.5%_+5.5	31.2%_+20.6	56.8%_+21.2
Skill-SD (Ours)	64.9%_+56.1	84.9%_+45.8	62.5%_+50.0	71.1%_+39.1	63.7%_+53.1	78.0%_+42.4

AppWorld training curves for Skill-SD and baselines — AppWorld training curves

Sokoban training curves for Skill-SD and baselines — Sokoban training curves

Ablations

Ablation Study

Student-Owned Rollout Is Essential

Both off-policy variants collapse during mid-training. The failure is especially severe on Sokoban, where off-policy accuracy drops to 12.5% or 10.9% — matching the uninstructed base model.

Dynamic Sync Keeps the Teacher Calibrated

Within on-policy training, synchronizing the teacher from the latest student checkpoint adds +15.8 pp on AppWorld and +12.5 pp on Sokoban over a frozen teacher.

Skills Should Guide the Teacher

Directly prepending skills to the student hurts performance: Skill-Augmented GRPO underperforms Vanilla GRPO on both AppWorld (42.1% vs. 50.9%) and Sokoban (20.3% vs. 51.6%).

Subscripts denote absolute change from Skill-SD. ^*Training collapsed during mid-training; values reflect the checkpoint before collapse.

Rollout	Teacher	AppWorld		Sokoban		Avg.
Rollout	Teacher	Acc.	Comp.	Acc.	Comp.	Acc.	Comp.
On-policy	Frozen	49.1%_−15.8	79.0%_−5.9	50.0%_−12.5	63.3%_−7.8	49.6%_−14.1	71.1%_−6.9
On-policy	Dynamic	64.9%	84.9%	62.5%	71.1%	63.7%	78.0%
Off-policy^*	Frozen	45.6%_−19.3	78.8%_−6.1	12.5%_−50.0	31.3%_−39.8	29.1%_−34.6	55.0%_−23.0
Off-policy^*	Dynamic	42.1%_−22.8	76.5%_−8.4	10.9%_−51.6	32.0%_−39.1	26.5%_−37.2	54.3%_−23.7

Optimization Dynamics

Teacher and Student Distributions Converge During SDL

Token-level self-distillation dynamics showing teacher-student distribution alignment and declining SDL loss — On a representative AppWorld task, teacher and student token distributions become progressively aligned. The SDL loss decreases by 59.3% over training.

AppWorld training dynamics for four rollout-teacher configurations: on-policy dynamic, on-policy frozen, off-policy frozen, off-policy dynamic — AppWorld training dynamics of the four rollout–teacher configurations

Sokoban training dynamics for four rollout-teacher configurations: on-policy dynamic, on-policy frozen, off-policy frozen, off-policy dynamic — Sokoban training dynamics of the four rollout–teacher configurations

Hyperparameter Sweep

SDL Coefficient λ on AppWorld

The SDL coefficient mediates the RL–distillation trade-off. λ = 0.001 achieves 81.19% validation completion, acting as a mild shaping term that guides the student without dominating the RL signal.

λ	Val. Completion
0.01	Unstable, below optimum
0.005	74.66%
0.001	81.19% (best)
0.0005	75.98%

AppWorld training and validation completion rate curves for four SDL coefficient values — λ = 0.001 achieves the best validation performance on AppWorld.

Training Principle

Reward sets the course.
Skill refines each turn.

GRPO provides global direction; the skill-conditioned teacher refines delicate token choices during training. At inference, the student uses only the plain task prompt.

Read the arXiv preprint

Citation

Cite this work

@misc{wang2026skillsdskillconditionedselfdistillationmultiturn,
  title={Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents},
  author={Hao Wang and Guozhi Wang and Han Xiao and Yufeng Zhou and Yue Pan and Jichao Wang and Ke Xu and Yafei Wen and Xiaohu Ruan and Xiaoxin Chen and Honggang Qi},
  year={2026},
  eprint={2604.10674},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2604.10674},
}

Skill-Conditioned Self-Distillation for Multi-turn LLM Agents