每日调研 2026-05-15 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-15 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-15 08:00 CST。重点覆盖 Hugging Face Papers 2026-05-13 榜单、arXiv 近几日提交/更新、GitHub 新仓库检索。arXiv API 在检索过程中多次返回 429/timeout，因此本文对可访问的 arXiv abs 页面与 Hugging Face Papers 页面做交叉核验；X/Twitter 未直接纳入，避免不可访问时引入未验证传闻。

#0. 今日结论先读

最近 24–72 小时里，最贴近 wenjun 研究主线的信号不是单篇 SOTA，而是一组高度一致的趋势：Agent 后训练正在从“最终答案可验证”转向“轨迹、记忆、世界模型、工具路径、异步训练系统”一起被建模。

值得优先关注的方向有四个：

Agentic RL 的工程问题开始被显式论文化：异步 rollout 的 off-policy correction、LoRA 策略训练/服务基础设施、rubric reward hacking 等，都在处理“真实 agent 训练系统”中的非理想因素。
World model for agents 正在从机器人/视频扩散迁移到 MCP、企业系统、软件系统：这与 LLM model-based RL / Dreamer for Agent 方向高度相关，但新论文也提醒：如果环境规则可以运行时读取，学习世界模型未必总是最优。
Agent 长期记忆不再只看用户画像，而开始评测环境经验、隐私、在线压缩与持续更新副作用：对代码 Agent、长轨迹 RL 和个人助手都很关键。
潜空间/循环推理继续升温：LoopUS、Attractor Models、Multi-Stream LLMs、Thinking in Code 等工作都在挑战“单流 token-by-token 解码 + 外部 CoT”的默认范式。

#1. 最值得关注的 5 条

#1.1 MinT: Managed Infrastructure for Training and Serving Millions of LLMs

链接：https://arxiv.org/abs/2605.13779
来源：arXiv / Hugging Face Papers
日期：2026-05-13
类别：Systems / Post-training RL / Agent Training Infrastructure
一句话核心贡献：提出 MindLab Toolkit（MinT），把 LoRA RL 训练、rollout、评估、服务、回滚组织成面向海量策略版本的托管基础设施，支持 frontier-scale dense/MoE 模型，并通过只移动 adapter 降低策略迭代成本。

为什么值得关注：

这篇不是又一个算法 trick，而是把“训练大量 agent/policy 版本”的工程形态抽象出来：base model 常驻，策略以 LoRA adapter revision 的形式在 rollout、update、export、evaluation、serving、rollback 之间流动。对 agentic RL 来说，未来真正制约研究迭代速度的往往不是单次 PPO/GRPO 公式，而是如何稳定地产生、评估和部署大量策略版本。

与 wenjun 的关系：

wenjun 关注 LLM Agent RL 与基础模型训练机制。MinT 可以作为理解“agent post-training 系统层瓶颈”的参考：如果做 Dreamer-style / model-based agent RL，也需要类似的策略版本管理、rollout 服务、adapter 快速切换与回滚机制。

#1.2 Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

链接：https://arxiv.org/abs/2605.12070
来源：arXiv / Hugging Face Papers
日期：2026-05-12
类别：Post-training RL / LLM Agent / Systems
一句话核心贡献：指出异步 agentic RL 中 PPO 风格 off-policy correction 存在“训练-推理分布差异”和“策略陈旧性”两类语义不同的比率混淆，并提出修复方法。

为什么值得关注：

大模型 agent 训练常常必须异步：rollout 很慢，优化也很贵，不可能完全同步等待。论文抓住了一个非常真实的问题：实践中旧 logits 缺失或版本不一致时，importance ratio 不再只是数学上一个简单校正项，而混合了 inference engine、训练引擎、历史策略和当前策略之间的多重偏差。

与 wenjun 的关系：

如果 wenjun 做长轨迹 LLM Agent RL，这篇很可能比许多 benchmark paper 更有价值。它提醒我们：agent RL 的“算法效果”可能被系统异步性、推理/训练栈差异污染。未来复现或设计 model-based RL for agents 时，需要记录策略版本、推理 logits、adapter revision、采样服务配置，否则 credit assignment 与 off-policy correction 都会变得不可解释。

#1.3 MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

链接：https://arxiv.org/abs/2605.09131
来源：arXiv / Hugging Face Papers
日期：2026-05-09
类别：Model-based RL / LLM Agent / Tool-use / World Model
一句话核心贡献：把 generative world model 接入 MCP 生态，提出“Bring Your Own World Model”式 agent 框架，使 agent 在工具环境中具备预测式任务自动化能力。

为什么值得关注：

MCP 正在成为 LLM 调用外部工具/环境的通用接口。如果世界模型也被抽象成 MCP agent 的可插拔组件，那么“agent 预训练 + 环境模型 + 规划”的研究会从 toy environment 进入真实工具生态。这是 LLM agent 版 Dreamer 的一个自然落点。

与 wenjun 的关系：

这篇与“LLM model-based RL / Dreamer for LLM Agent”直接相关。值得重点看它如何定义环境状态、动作、预测目标，以及 world model 在 MCP 中到底预测什么：工具返回？任务进度？失败模式？如果它只是生成式模拟器，还要进一步追问能否用于 policy improvement，而不只是 planning aid。

#1.4 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

链接：https://arxiv.org/abs/2605.10899
来源：arXiv / Hugging Face Papers
日期：2026-05-11
类别：Post-training RL / LLM Agent / Evaluation / Long-horizon Research Agent
一句话核心贡献：针对深度研究 agent 这类没有唯一 ground truth 的长轨迹任务，提出用 rubric 同时组织策略执行、judge feedback 与 agent memory 的 meta-RL 框架。

为什么值得关注：

最近 RLVR 在数学/代码上很强，但深度研究、信息综合、实验配置这类任务缺乏可验证最终答案。RubricEM 的重点是把 rubric 从“最终打分器”变成中间策略分解和记忆组织接口，这比单纯 judge final answer 更适合长轨迹 agent。

与 wenjun 的关系：

wenjun 关注从指令理解到意图理解、长轨迹 agent RL。RubricEM 提供了一个可研究的问题：rubric 能否作为 latent task state / option decomposition？如果把 rubric item 看成子目标，是否能结合 world model 预测每个子目标达成概率？

链接：https://arxiv.org/abs/2605.11011
来源：arXiv / Hugging Face Papers
日期：2026-05-10，2026-05-11 更新
类别：Latent Reasoning / Test-time Scaling / Model Architecture
一句话核心贡献：提出 Looped Depth Up-Scaling，把标准预训练 LLM 改造成 encoder + looped reasoning block + decoder 的循环潜空间 refinement 架构，以较低成本获得 test-time compute 扩展。

为什么值得关注：

潜空间推理方向的核心问题是：模型是否必须把每一步推理显式写成 token？LoopUS 的价值在于它不要求从头训练 recurrent transformer，而是尝试把已有预训练模型重组为可循环 refinement 的结构。这对“复用现有 base model 做 latent reasoning”很有启发。

与 wenjun 的关系：

wenjun 近期重点关注 latent-space reasoning。LoopUS 适合与 Attractor Models、Multi-Stream LLMs、Thinking in Code 对照阅读：它们分别从循环深度、固定点 refinement、多流计算、代码作为推理介质四个角度突破单流语言推理。

#2. 论文与动态清单

#2.1 Agent RL / 后训练 / Reward

#RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

链接：https://arxiv.org/abs/2605.10899
来源：arXiv / HF Papers
日期：2026-05-11
类别：Post-training RL / LLM Agent / Evaluation
核心贡献：把 rubric 用作策略分解、judge 反馈与 agent memory 的共享接口，面向深度研究 agent 等不可简单验证任务。

#Missing Old Logits in Asynchronous Agentic RL

链接：https://arxiv.org/abs/2605.12070
来源：arXiv / HF Papers
日期：2026-05-12
类别：Post-training RL / Systems
核心贡献：分析异步 agentic RL 中 old logits 缺失导致的 off-policy correction 语义错配，并讨论修复方法。

#Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

链接：https://arxiv.org/abs/2605.07579
来源：arXiv / HF Papers
日期：2026-05-08，2026-05-11 更新
类别：Post-training RL / RLVR
核心贡献：使用 policy 模型前向过程中已有 hidden states 估计 value baseline，降低 PPO/GRPO 中 critic 或多 rollout 的额外成本。

#Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

链接：https://arxiv.org/abs/2605.12483
来源：arXiv / HF Papers
日期：2026-05-12
类别：Post-training RL / Distillation
核心贡献：提出 sparse reward 应用于探索生产力高的模型，dense token-level teacher reward 用于压缩到小模型的经验原则，重新解释 GRPO 与 on-policy distillation 的分工。

#The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

链接：https://arxiv.org/abs/2605.11182
来源：arXiv / HF Papers
日期：2026-05-11
类别：Post-training RL / Distillation
核心贡献：系统研究 OPD/OPSD 何时有效、何时退化，指出 teacher 选择与 loss formulation 对数学推理等任务高度敏感。

#Reward Hacking in Rubric-Based Reinforcement Learning

链接：https://arxiv.org/abs/2605.12474
来源：arXiv / HF Papers
日期：2026-05-12
类别：Post-training RL / Evaluation / Safety
核心贡献：研究 rubric-based RL 中训练 verifier 与跨模型 judge panel 的偏差，区分 verifier failure 与 rubric-design limitations 两类 reward hacking 来源。

#On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

链接：https://arxiv.org/abs/2605.11882
来源：arXiv / HF Papers
日期：2026-05-12
类别：LLM Agent / Safety / Post-training RL
核心贡献：提出 FATE，用 verifier 评分的失败轨迹生成修复监督信号，面向工具使用 agent 的轨迹级安全对齐。

#2.2 World Model / Model-based Agent

#MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

链接：https://arxiv.org/abs/2605.09131
来源：arXiv / HF Papers
日期：2026-05-09
类别：Model-based RL / LLM Agent / Tool-use
核心贡献：将 world model 接入 MCP 工具生态，使 agent 能在复杂任务执行中进行预测式自动化。

#Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

链接：https://arxiv.org/abs/2605.12178
来源：arXiv / HF Papers
日期：2026-05-12
类别：World Model / Enterprise Agent / Tool-use
核心贡献：指出企业系统动态常由租户特定、可读取、会变化的业务逻辑定义；当规则可在推理时读取时，运行时发现可能比离线学习世界模型更稳健。

研究判断：这篇对 model-based agent 是一个重要反例提醒：LLM agent 的 world model 不应盲目学习所有 transition。对于可查询、可检查、可执行的规则，agent 也许应学习“如何发现规则”，而不是把规则压进参数。

#World Action Models: The Next Frontier in Embodied AI

链接：https://arxiv.org/abs/2605.12090
来源：arXiv / HF Papers
日期：2026-05-12
类别：Model-based RL / Embodied AI / World Model
核心贡献：提出 World Action Models 范式：把未来状态预测与动作生成统一起来，建模 future states and actions 的联合分布。

#World Model for Robot Learning: A Comprehensive Survey

链接：https://arxiv.org/abs/2605.00080
来源：arXiv / HF Papers
日期：2026-05 月上旬
类别：World Model / Survey / Embodied AI
核心贡献：系统综述机器人学习中的 world model，可作为 LLM agent model-based RL 的外部参照。

#2.3 Latent Reasoning / 推理结构

链接：https://arxiv.org/abs/2605.11011
来源：arXiv / HF Papers
日期：2026-05-10，2026-05-11 更新
类别：Latent Reasoning / Test-time Scaling
核心贡献：把预训练 LLM 改造成 looped latent refinement 架构，在潜空间中迭代计算。

#Solve the Loop: Attractor Models for Language and Reasoning

链接：https://arxiv.org/abs/2605.12466
来源：arXiv / HF Papers
日期：2026-05-12
类别：Latent Reasoning / Architecture
核心贡献：提出 Attractor Models，通过固定点求解与隐式微分实现可自适应迭代深度的 latent refinement。

#Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

链接：https://arxiv.org/abs/2605.12460
来源：arXiv / HF Papers
日期：2026-05-12
类别：Agent Architecture / Latent Reasoning / Tool-use
核心贡献：指出当前 chat/agent 仍受单一消息流瓶颈限制，提出并行 thoughts、inputs、outputs 的 multi-stream LLM 方向。

#Teaching Language Models to Think in Code

链接：https://arxiv.org/abs/2605.07237
来源：arXiv / HF Papers
日期：2026-05-08，2026-05-11 更新
类别：Code Intelligence / Reasoning / Tool-use
核心贡献：提出 ThinC：让代码本身承担推理过程，而不是让自然语言 CoT 调用代码作为事后验证器。

#A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

链接：https://arxiv.org/abs/2605.13687
来源：arXiv
日期：2026-05-13
类别：Reasoning Theory / Scaling Laws
核心贡献：在树状广播过程生成的合成语言上，用可分析的 k-gram ansatz 研究 context length 与 reasoning 的作用，并给出可证明收益。

#2.4 Agent Memory / Continual Learning / Context Compression

#δ-mem: Efficient Online Memory for Large Language Models

链接：https://arxiv.org/abs/2605.12357
来源：arXiv / HF Papers
日期：2026-05-12
类别：Agent Memory / Context Compression
核心贡献：为冻结 full-attention backbone 增加小型在线 associative memory state，用 delta-rule 更新并产生 attention low-rank corrections，以固定大小状态压缩历史信息。

#MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

链接：https://arxiv.org/abs/2605.09530
来源：arXiv / HF Papers
日期：2026-05-10，2026-05-12 更新
类别：Agent Memory / Privacy / Edge-Cloud Agent
核心贡献：在边端识别隐私敏感 span，并用语义结构化 placeholder 供云端记忆处理，试图兼顾隐私与记忆可用性。

#LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

链接：https://arxiv.org/abs/2605.12493
来源：arXiv / HF Papers
日期：2026-05-12
类别：Agent Memory / Evaluation
核心贡献：评估 memory system 是否能帮助 agent 内化 web 环境中的 affordance、状态动态、工作流与失败模式，使其像“有经验的同事”。

#Useful Memories Become Faulty When Continuously Updated by LLMs

链接：https://arxiv.org/abs/2605.12978
来源：HF Papers
日期：2026-05-13 榜单
类别：Agent Memory / Continual Learning
核心贡献：关注 LLM 持续更新记忆时，有用记忆如何逐渐变成错误或污染源；适合与 LongMemEval-V2 对读。

#Learning, Fast and Slow: Towards LLMs That Adapt Continually

链接：https://arxiv.org/abs/2605.12484
来源：arXiv / HF Papers
日期：2026-05-12
类别：Continual Learning / Post-training
核心贡献：提出不要把学习限制在 in-context 或 in-weights，而应让 LLM 在多时间尺度上结合快速上下文适应与慢速参数更新。

#2.5 Code Agent / Computer Use / Tool-use

#ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

链接：https://arxiv.org/abs/2605.12481
来源：arXiv / HF Papers
日期：2026-05-12
类别：Computer Use Agent / Tool-use
核心贡献：学习 GUI 原子动作与高层工具调用之间的路径选择，解决 hybrid action space 中何时点击、何时调用 API 的决策问题。

#Continual Harness: Online Adaptation for Self-Improving Foundation Agents

链接：https://arxiv.org/abs/2605.09998
来源：arXiv / HF Papers
日期：2026-05-11
类别：LLM Agent / Continual Learning / Self-improvement
核心贡献：从 Gemini Plays Pokemon 等实验出发，提出面向长时程部分可观测决策的 continual harness，强调工具、记忆、规划与人类迭代共同形成自改进 agent。

#AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

链接：https://arxiv.org/abs/2605.11518
来源：arXiv / HF Papers
日期：2026-05-12
类别：Research Agent / AutoML / Systems
核心贡献：面向高成本 LLM 实验配置自动化，试图让 research agent 从廉价设置学习，再优化昂贵实验配置。

#Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

链接：https://arxiv.org/abs/2605.12411
来源：arXiv / HF Papers
日期：2026-05-12
类别：Multi-agent / Agent Modeling / Evaluation
核心贡献：在谈判/交易游戏中，把少量历史交互、结构化状态与对话结合，预测陌生 agent 的下一步决策。

#2.6 预训练、数据与长上下文

#Efficient Pre-Training with Token Superposition

链接：https://arxiv.org/abs/2605.06546
来源：arXiv / HF Papers
日期：2026-05-07
类别：Pretraining / Efficiency
核心贡献：提出 Token-Superposition Training，把多个连续 token 合并成 bag 并用 multi-hot cross entropy 训练，再恢复到标准 token 训练，以提高 pretraining data throughput per FLOP。

#A Causal Language Modeling Detour Improves Encoder Continued Pretraining

链接：https://arxiv.org/abs/2605.12438
来源：arXiv / HF Papers
日期：2026-05-12
类别：Continual Pretraining / Encoder / Training Mechanism
核心贡献：在 domain adaptation 时先做 CLM detour 再短暂 MLM decay，可提升 biomedical encoder continued pretraining 效果，并分析 CLM 对低层 transformer 的影响。

#FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

链接：https://arxiv.org/abs/2605.09932
来源：arXiv / HF Papers
日期：2026-05-11
类别：Long Context / Fine-tuning
核心贡献：指出长上下文 SFT 中 attention budget 被位置偏置与 sink 稀释，提出 dilution-aware bilevel optimization 改善长上下文能力学习。

#Negation Neglect: When models fail to learn negations in training

链接：https://arxiv.org/abs/2605.13829
来源：arXiv
日期：2026-05-13
类别：Training Data / Model Behavior / Factuality
核心贡献：发现 finetuning 在反复声明“某 claim 为假”的文档上，模型反而可能把该 claim 当真，说明训练数据中的否定语义在参数更新中可能被忽略。

研究判断：这对“预训练数据如何塑造能力/信念”很重要：模型可能学习到实体-事件共现，而非逻辑否定关系。对数据去重、反事实数据、辟谣语料训练都提出了风险。

#3. 今日最值得精读的 3 篇

Missing Old Logits in Asynchronous Agentic RL

链接：https://arxiv.org/abs/2605.12070

理由：非常贴近真实 LLM Agent RL 系统，能帮助判断异步 rollout + PPO/GRPO 的实验是否可信。

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

链接：https://arxiv.org/abs/2605.09131

理由：直接连接 MCP 工具生态与 world model，是 LLM model-based agent 的重要线索。

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

链接：https://arxiv.org/abs/2605.11011

理由：与 latent-space reasoning 强相关，而且关注如何复用 pretrained LLM，而非从头训练新架构。

备选精读：RubricEM（长轨迹 research agent RL）、MinT（agent 训练/服务基础设施）、δ-mem（在线记忆压缩）。

#4. 今日最值得跟进的 3 个 repo/model/dataset

GitHub 检索使用 GitHub Search API，限定 2026-05-08 之后创建的新仓库并按 stars 排序；新仓库质量需要后续人工复核。

sontianye/AgenticQwen

- 链接：https://github.com/sontianye/AgenticQwen

- 类别：Code Agent / Agentic RL

- 说明：仓库描述为复现 AgenticQwen（arXiv:2604.21590），包含 dual-flywheel data synthesis + GRPO RL training for agentic small LLMs。适合跟进小模型 agentic RL 的开源复现路径。

johunsang/semble_rs

- 链接：https://github.com/johunsang/semble_rs

- 类别：Code Agent / Tooling / Code Search

- 说明：面向 AI-agent-native code search，结合 BM25、semantic search、Tree-sitter AST chunking、dependency/impact analysis。对代码 Agent 的工具环境设计有参考价值。

NeuraLiying/Awesome-World-Models

- 链接：https://github.com/NeuraLiying/Awesome-World-Models

- 类别：World Model / Survey Resource

- 说明：新建 world model 论文清单，覆盖视频生成、自动驾驶、机器人、LLM、3D/4D、物理模拟和 benchmark。可作为追踪 LLM Agent world model 迁移路径的资料入口。

补充：HF Papers 2026-05-13 上的 MinT 若后续公开代码/系统 demo，值得第一时间跟进；ToolCUA 若开放数据/trajectory，也适合纳入代码/GUI agent 训练数据研究。

#5. 研究机会 / Ideas

#Idea 1：把 MCP 工具环境建模成“可查询规则 + 可学习残差”的 world model

Do Enterprise Systems Need Learned World Models? 提醒我们，很多软件/企业环境的规则不是隐含物理规律，而是可读取配置、API schema、权限策略、业务逻辑。可以做一个 LLM Agent Dreamer 变体：

显式读取 schema / docs / config 作为 symbolic dynamics；
学一个 residual world model 预测文档没写清的失败模式、latency、权限边界、UI 状态变化；
planning 时同时使用可验证规则和 learned residual。

这比“全靠神经 world model 模拟工具返回”更符合软件 agent 的环境结构。

#Idea 2：异步 agentic RL 的可复现实验账本

Missing Old Logits 与 MinT 都说明：agent RL 的系统元数据会强烈影响算法结论。可以设计一个“RL trajectory accounting schema”：

每条 trajectory 记录 base model hash、adapter revision、inference engine、sampling config、old logits availability、tool environment version；
区分 training-inference discrepancy 与 policy staleness；
在同一任务上比较 naive PPO/GRPO、带旧 logits correction、无旧 logits repair 的差异。

这会非常适合写成实验性论文或系统分析文章。

#Idea 3：长期记忆的“有用性—污染”动态评测

δ-mem、LongMemEval-V2、Useful Memories Become Faulty When Continuously Updated by LLMs 共同指向一个问题：记忆不是越多越好，也不是更新越频繁越好。可以做代码 Agent 场景：

环境：同一代码库多轮 issue 修复；
memory：保存 API 约定、测试失败、模块依赖、历史 patch；
干扰：代码库重构、接口变更、错误历史结论；
评测：memory 是否帮助 agent 像“老同事”一样工作，还是把旧错误带入新任务。

这能自然连接 agent pretraining data、context compression、continual learning 与 code intelligence。

#6. 检索限制说明

Hugging Face Papers 页面可访问，并用于抓取 2026-05-13 榜单。
arXiv abs 页面可访问；arXiv API 在批量检索时出现 429/timeout，因此本文优先使用已验证的 abs 页面标题、摘要、日期。
X/Twitter 未直接访问；为避免引入未核验热点，本文未引用推特传闻，而使用 arXiv、HF Papers 与 GitHub API 替代。

#2026-05-15 AI/LLM 最新论文与研究热点简报

#0. 今日结论先读

#1. 最值得关注的 5 条

#1.1 MinT: Managed Infrastructure for Training and Serving Millions of LLMs

#1.2 Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

#1.3 MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

#1.4 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

#1.5 LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

#2. 论文与动态清单

#2.1 Agent RL / 后训练 / Reward

#RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

#Missing Old Logits in Asynchronous Agentic RL

#Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

#Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

#The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

#Reward Hacking in Rubric-Based Reinforcement Learning

#On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

#2.2 World Model / Model-based Agent

#MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

#Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

#World Action Models: The Next Frontier in Embodied AI

#World Model for Robot Learning: A Comprehensive Survey

#2.3 Latent Reasoning / 推理结构

#LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

#Solve the Loop: Attractor Models for Language and Reasoning

#Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

#Teaching Language Models to Think in Code

#A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

#2.4 Agent Memory / Continual Learning / Context Compression

#δ-mem: Efficient Online Memory for Large Language Models

#MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

#LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

#Useful Memories Become Faulty When Continuously Updated by LLMs

#Learning, Fast and Slow: Towards LLMs That Adapt Continually

#2.5 Code Agent / Computer Use / Tool-use

#ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

#Continual Harness: Online Adaptation for Self-Improving Foundation Agents

#AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

#Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

#2.6 预训练、数据与长上下文

#Efficient Pre-Training with Token Superposition

#A Causal Language Modeling Detour Improves Encoder Continued Pretraining

#FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

#Negation Neglect: When models fail to learn negations in training

#3. 今日最值得精读的 3 篇

#4. 今日最值得跟进的 3 个 repo/model/dataset

#5. 研究机会 / Ideas

#Idea 1：把 MCP 工具环境建模成“可查询规则 + 可学习残差”的 world model

#Idea 2：异步 agentic RL 的可复现实验账本

#Idea 3：长期记忆的“有用性—污染”动态评测

#6. 检索限制说明