Tag Archive

标签：OPD

这里整理所有带有「OPD」标签的文章，方便按主题快速回看。

OPD

共 3 篇

主题归档 · 2026-06-17

从 LUFFY 看 reasoning SFT 的 off-policy 问题：从“背高手答案”到在学生分布上学习

以 LUFFY 为锚点，梳理 reasoning SFT 中 teacher trace 与 student policy 分布错配的问题，以及后续沿 RLVR、OPD、backtracking、agent step-wise distillation 等方向形成的研究现状。

LLM Reasoning Think SFT Off-Policy RLVR OPD Distillation

主题归档 · 2026-05-16

大模型 OPD：经典工作、发展逻辑与最新问题

系统梳理大模型 On-Policy Distillation 的定义、经典工作、发展逻辑、方法谱系与当前开放问题。

LLM OPD On-Policy Distillation 后训练 LLM Agent

主题归档 · 2026-05-04

从 OPD 到 OPSD / ExOPD：解读群聊里关于 On-Policy Distillation 的几篇论文

解读 Thinking Machines 的 On-Policy Distillation 博客，以及 arXiv:2604.13016、2603.25562、2601.18734、2602.12125 四篇工作，讲清 OPD、SFT 冷启动、teacher-supported region、OPSD、自蒸馏、多专家蒸馏和 log-prob shift 背后的技术逻辑。

OPD distillation reinforcement-learning LLM-post-training OPSD ExOPD