每日调研 2026-05-16 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-16 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-16 08:00（Asia/Shanghai）。主要覆盖 Hugging Face Daily Papers、arXiv abs 页面、GitHub/项目页搜索；arXiv API 在本次运行中返回 429 Rate exceeded，因此改用 arXiv 论文 HTML 页面核验标题、摘要、日期与分类。X/Twitter 未作为主要来源使用，避免把不可稳定访问的信息当作事实；技术动态以 HF、arXiv、GitHub、DuckDuckGo 可检索项目页为准。

#0. 今日总览：Agent 方向的“环境/记忆/自蒸馏”明显升温

最近 24-48 小时里，最贴近 wenjun 当前主线的信号集中在三条线上：

Agentic RL 正在从“轨迹级奖励”走向更细粒度的过程监督与自蒸馏：SDAR 把 privileged teacher 的 token-level 信号作为 gated auxiliary objective，与 RL 主干结合，直接对长轨迹 agent 的稀疏奖励问题开刀。
自演化不再只生成题目或轨迹，而是生成“可验证环境”：Learning to Build the Environment 把 zero-data reasoning RL 重构成环境构造循环，核心是 solve-verify asymmetry；这与“通过环境设计催生自演化智能”高度相关。
Agent memory 从存储内容演进到检索机制/视觉证据/过期状态的系统评测与自优化：EvolveMem、PREPING、MemEye、STALE 共同指向一个问题：长期 agent 的能力瓶颈不是“有没有 memory”，而是 memory 是否能在任务前冷启动、随环境变化更新、保留必要证据，并自我诊断检索失败。

另外，ATLAS 讨论 agentic 与 latent visual reasoning 的折中，SANA-WM 提供高效长视频 world model，Orchard 试图补齐开放 agentic training infrastructure。这些都值得纳入 wenjun 对 model-based RL / latent reasoning / agent 预训练数据的中期观察清单。

#1. 重点论文与动态筛选

#1. Self-Distilled Agentic Reinforcement Learning

链接：https://arxiv.org/abs/2605.15155 ；HF：https://huggingface.co/papers/2605.15155 ；Repo：https://github.com/ZJU-REAL/SDAR/
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：LLM Agent / Post-training RL / Agentic RL / Tool-use
一句话核心贡献：提出 SDAR（Self-Distilled Agentic Reinforcement Learning），把 on-policy self-distillation 从单轮 reasoning 扩展到多轮 agent，通过 gated auxiliary objective 给长轨迹 agent 提供 token-level dense guidance，同时保持 RL 作为主优化骨架。

为什么值得关注：长轨迹 LLM agent 的 RL 难点在于 reward 太稀疏，且多轮交互里错误会层层放大。论文指出 OPSD 式 privileged teacher 如果直接迁移到 multi-turn agents，会出现多轮不稳定与负 teacher rejection 的非对称问题。SDAR 的关键处理是：把 detached token-level teacher signal 映射成 sigmoid gate，对 teacher 认可的 positive-gap tokens 加强蒸馏，对负向拒绝做软衰减，而不是让 teacher 完全支配策略更新。

与 wenjun 研究方向的关系：这篇几乎正中“代码 Agent 的 agentic RL / 长轨迹 RL”问题。它提供了一个可借鉴的训练范式：RL 仍负责最终任务成功；teacher branch 提供中间 token-level 形状化信号；gating 机制防止 teacher 在多轮场景中把错误监督放大。如果 wenjun 后续做 code agent 或 tool-use agent 的 RL，可以把 SDAR 当作“轨迹级 RL + 局部自蒸馏”的基线方案，并进一步研究 teacher privileged context 来自哪里：检索记忆、执行 trace、单元测试、环境状态，还是 world model rollouts。

#2. Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

链接：https://arxiv.org/abs/2605.14392 ；HF：https://huggingface.co/papers/2605.14392
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：Post-training RL / Model-based RL / Environment Design / Self-evolving Agent
一句话核心贡献：把 zero-data reasoning RL 从“生成数据/轨迹”推进到“生成可复用、可采样、可评分的可验证环境”，强调环境必须具备稳定的 solve-verify asymmetry。

为什么值得关注：论文的核心判断非常重要：自我提升不应只靠模型生成更多题目或答案，而应让模型构造训练自己的环境。每个环境是可执行对象，能采样实例、计算参考答案并给 responses 打分。其可持续性的关键是 solve-verify asymmetry：模型可以写出 oracle / verifier，但在自然语言中并不能稳定解出新实例。这样才有“自己搭靶场，但靶场仍然能训练自己”的可能。

与 wenjun 研究方向的关系：这篇与 wenjun 关心的“通过环境设计催生自演化智能”“LLM model-based RL / Dreamer for LLM Agent”非常接近。虽然它不是传统 Dreamer 式 latent world model，但它把环境建模转化成可验证 executable environment synthesis。对 code agent：环境可以是自动生成的 repo bug、单测、API 文档与 hidden tests；对 web/tool agent：环境可以是可重放任务状态机；对 long-horizon RL：环境生成器相当于 curriculum/world generator，而 verifier 是可学习/可组合的 reward model。值得进一步追问：如果环境本身由 LLM 生成，如何避免 verifier leakage、环境分布塌缩、奖励黑客和过拟合到自造 oracle？

#3. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

链接：https://arxiv.org/abs/2605.15198 ；HF：https://huggingface.co/papers/2605.15198 ；Repo：https://github.com/ZiyuGuo99/ATLAS
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：Latent Reasoning / LLM Agent / Multimodal Reasoning / Tool-use
一句话核心贡献：提出 ATLAS，试图用一个离散“word/token”桥接 agentic visual reasoning 与 latent visual reasoning，减少外部工具调用延迟，同时提升 latent 方法的泛化与自回归训练友好性。

为什么值得关注：视觉推理中有两条路线：一是通过 code/tool 生成中间视觉状态，二是在 latent embeddings 中完成隐式推理。前者可解释、可泛化但慢；后者高效但难训练、泛化弱。ATLAS 的摘要明确把二者放在同一张图里比较：agentic 方法有 context-switching latency，latent 方法缺乏任务泛化且不适合自回归并行训练。它尝试用离散 token 化的中间推理接口折中两者。

与 wenjun 研究方向的关系：虽然论文场景是视觉推理，但问题结构可迁移到 LLM latent-space reasoning：是否可以用少量离散 latent/action tokens 代表“内部工具调用”或“隐式状态转移”？agent 的外部 tool trace 能否蒸馏成 latent token，从而减少推理时工具调用？latent reasoning 是否需要一个可验证的外部轨迹作为训练脚手架？这对“从显式 agentic trace 到潜空间推理”的研究路线很有启发。

#4. Orchard: An Open-Source Agentic Modeling Framework

链接：https://arxiv.org/abs/2605.15040 ；HF：https://huggingface.co/papers/2605.15040 ；Repo：https://github.com/microsoft/Orchard/tree/readme-only
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：LLM Agent / Tool-use / Systems / Agent Training Infrastructure
一句话核心贡献：微软提出开源 agentic modeling framework，面向可扩展 agent 训练，包含环境服务、sandbox 生命周期管理、agent harness 与训练/评测基础设施。

为什么值得关注：Agent 研究现在越来越依赖基础设施：真实工具、sandbox、任务环境、rollout 收集、评测、reward 计算、失败诊断。Orchard 明确指出当前开放研究受限于 proprietary codebases/models/services，已有开源框架更多偏 orchestration/evaluation，而不是 scalable agent training。它试图把 agentic modeling 的训练基础设施开源化。

与 wenjun 研究方向的关系：如果 wenjun 要做 code agent RL 或长轨迹 agent pretraining，基础设施本身就是研究瓶颈。Orchard 可作为对比对象，观察它如何抽象 sandbox lifecycle、task domain primitives、agent harness、rollout / reward / evaluation 接口，以及是否支持多模型、多环境、多轮交互的数据闭环。它也可以和 OpenClaw/Hermes Agent、SWE-bench 风格环境、未来的 model-based RL world simulator 进行横向比较。

#5. EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents

链接：https://arxiv.org/abs/2605.13941 ；HF：https://huggingface.co/papers/2605.13941 ；Repo：https://github.com/aiming-lab/SimpleMem/tree/main/EvolveMem
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 13 May 2026
类别：LLM Agent / Memory / Self-evolving Agent / Evaluation
一句话核心贡献：提出 EvolveMem，把长期记忆系统的检索配置暴露为结构化 action space，让 LLM diagnosis module 根据失败日志自动调整检索、融合与回答策略，并带有回滚和探索机制。

为什么值得关注：大多数 memory agent 只更新“存了什么”，而检索机制、融合策略、生成策略固定。EvolveMem 的判断是：真正自适应的 memory 需要 stored knowledge 与 retrieval mechanism 共同演化。系统每轮读取 per-question failure logs，诊断根因，提出配置调整；guarded meta-analyzer 负责回滚退化、停滞时探索。

与 wenjun 研究方向的关系：这篇适合作为“agent 自演化”的一个具体模块案例。它不是端到端改模型权重，而是在 memory architecture 层做 AutoResearch。对 long-horizon code agent 来说，失败日志可能来自编译错误、测试失败、issue 讨论、历史 patch、API mismatch。一个可研究问题是：memory retrieval policy 能否作为 RL action space，由任务成功率或 verifier reward 驱动，而不只靠 LLM 诊断启发式？

#6. PREPING: Building Agent Memory without Tasks

链接：https://arxiv.org/abs/2605.13880 ；HF：https://huggingface.co/papers/2605.13880
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 11 May 2026
类别：LLM Agent / Memory / Pretraining Data / Synthetic Data
一句话核心贡献：研究 pre-task memory construction：agent 在还没看到目标环境任务前，是否能通过自生成 synthetic practice 建立 procedural memory，以缓解冷启动。

简评：PREPING 的核心是 proposer-guided memory construction。Proposer 依据 structured proposer memory 生成 synthetic tasks，Solver 执行，系统过滤并沉淀 procedural memory。它强调没有控制的 synthetic interaction 会冗余、不可行且污染 memory，因此“练什么、存什么”比“多生成一点经验”更关键。这和“agent 预训练数据如何塑造能力”直接相关：agent 预训练不是泛泛收集交互轨迹，而是要设计 practice distribution 与 memory filtering。对 code agent，可对应到“在未知 repo 前先练习类似 API、调试模式、测试修改套路”。

#7. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

链接：https://arxiv.org/abs/2605.10912 ；HF：https://huggingface.co/papers/2605.10912 ；Repo：https://github.com/InternLM/WildClawBench/
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 11 May 2026
类别：LLM Agent / Evaluation / Tool-use / Long-horizon Agent
一句话核心贡献：提出 native-runtime long-horizon agent benchmark，60 个真人编写、双语、多模态任务，平均约 8 分钟 wall-clock、20+ tool calls，在真实 CLI agent harness 中运行。

简评：相比短任务、mock API、final-answer 检查，WildClawBench 更接近真实 agent 工作流：Docker 环境、真实工具、OpenClaw/Claude Code/Codex/Hermes Agent 等 harness、规则检查 + 环境状态审计 + LLM/VLM judge。对 wenjun 来说，它的价值不只是 benchmark 分数，而是任务设计与 grading protocol：如何评估长轨迹副作用、环境状态和语义完成度。

#8. MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

链接：https://arxiv.org/abs/2605.15128 ；HF：https://huggingface.co/papers/2605.15128
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：LLM Agent / Memory / Multimodal / Evaluation
一句话核心贡献：提出视觉中心的 multimodal agent memory 评测框架，从 decisive visual evidence granularity 与 evidence usage 两个维度衡量 memory 是否真正保留了视觉证据。

简评：很多所谓视觉记忆问题可被 caption 或文本 trace shortcut 解掉。MemEye 刻意构造需要细粒度视觉证据、变化状态综合的场景，并通过 answerability、shortcut resistance、visual necessity、reasoning structure 等 gate 做验证。对 agent memory 研究而言，这提示我们：评测必须防 shortcut，否则 memory 看似有效其实只是语言先验有效。

#9. STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

链接：https://arxiv.org/abs/2605.06527 ；HF：https://huggingface.co/papers/2605.06527
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 7 May 2026；HF 今日仍在推荐列表中出现
类别：LLM Agent / Memory / Continual Learning / Evaluation
一句话核心贡献：提出 STALE benchmark，评测 agent 是否能识别隐式冲突并更新过期记忆，而不是只做静态事实检索。

简评：它定义的 Implicit Conflict 很重要：后续观察没有显式否定旧记忆，却让旧记忆失效。评测包括 State Resolution、Premise Resistance、Implicit Policy Adaptation。对长期 agent，这是 continual learning / memory update 的核心 failure mode。

#10. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

链接：https://arxiv.org/abs/2605.15178 ；HF：https://huggingface.co/papers/2605.15178
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：Model-based RL / World Model / Systems / Long-context Modeling
一句话核心贡献：提出 2.6B 参数开源 world model，面向 720p、分钟级视频生成和精确 6-DoF camera control，核心包括 Hybrid Linear Attention、Dual-Branch Camera Control 与两阶段生成。

简评：这不是 LLM agent world model，但对“高效长时序 world modeling”有参考价值。其 Hybrid Linear Attention 通过 frame-wise Gated DeltaNet + softmax attention 做长上下文高效建模；相机控制和 annotation pipeline 说明 world model 需要高质量 action labels。映射到 LLM agent：如果把工具/API/环境状态看作“动作标签”，那么高质量 action-conditioned trajectory annotation 是训练 agent world model 的关键。

#11. Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

链接：https://arxiv.org/abs/2605.13301 ；HF：https://huggingface.co/papers/2605.13301
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 13 May 2026
类别：Reasoning Model / Post-training RL / Test-time Scaling
一句话核心贡献：提出将 post-trained reasoning backbone 转成奥赛级 solver 的统一 recipe：reverse-perplexity curriculum SFT、两阶段 RLVR/proof-level RL、test-time scaling。

简评：这篇更偏 reasoning model scaling，但方法元素值得关注：用 curriculum SFT 注入 proof-search/self-checking，再用 RLVR 和 proof-level RL 放大，最后 test-time scaling。它报告模型支持超过 100K token 的稳定推理轨迹。对 long-horizon agent 来说，问题类似：先用结构化轨迹 SFT 建立行为，再用可验证奖励和测试时搜索放大能力。

#12. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

链接：https://arxiv.org/abs/2605.11458 ；HF：https://huggingface.co/papers/2605.11458
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 12 May 2026
类别：Post-training RL / Reasoning / Self-distillation
一句话核心贡献：指出 on-policy self-distillation 中 teacher 总是看到完整参考推理会造成 exposure mismatch，主张根据 student 能力自适应控制 teacher 暴露程度。

简评：这篇和 SDAR 可放在一起看。共同问题是：teacher 太强不一定好，privileged information 的暴露要匹配学生当前能力。对 agent 训练，这意味着 privileged environment trace、hidden tests、solver trajectory 不能无脑全给 teacher，否则 token targets 可能超出 agent 当前策略分布。

#13. RewardHarness: Self-Evolving Agentic Post-Training

链接：https://arxiv.org/abs/2605.08703 ；HF：https://huggingface.co/papers/2605.08703 ；Repo：https://github.com/TIGER-AI-Lab/RewardHarness
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 9 May 2026；HF 今日列表仍推荐
类别：Post-training RL / Reward Model / Self-evolving Agent / Tool-use
一句话核心贡献：把 reward modeling 重构为 context/tool/skill library 的自演化，而非大规模标注 + 训练一个固定 reward model。

简评：虽然应用是 instruction-guided image edits，但思想可迁移：reward 不一定全在权重里，也可以在可演化的工具、技能、判据库里。对 code agent RL，可以想象 reward harness 包括测试生成器、静态分析器、patch 风险检查、用户意图 rubric 等，并随失败样例持续演化。

#14. Long Context Pre-Training with Lighthouse Attention

链接：https://arxiv.org/abs/2605.06554 ；HF：https://huggingface.co/papers/2605.06554
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 7 May 2026；HF 今日列表仍推荐
类别：Pretraining / Long Context / Systems / Context Compression
一句话核心贡献：提出 Lighthouse Attention，一种 training-only 的对称选择式层级注意力，用于降低极长序列预训练的二次复杂度，训练末期可移除。

简评：对 wenjun 的“通用上下文压缩器 / 长上下文训练机制”方向有参考价值。它不是推理时压缩，而是训练期 adaptive compression/decompression wrapper。值得看它是否会改变模型形成长程依赖能力的方式，以及移除后是否保留 long-context 泛化。

#15. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

链接：https://arxiv.org/abs/2605.14892 ；HF：https://huggingface.co/papers/2605.14892
来源/日期：arXiv / Hugging Face Daily Papers，Submitted on 14 May 2026
类别：LLM Agent / Multi-Agent / Self-evolving Agent / Survey
一句话核心贡献：提出 LIFE progression：Lay foundation、Integrate collaboration、Find faults、Evolve self-improvement，把多智能体协作、失败归因和自演化放入因果链条中综述。

简评：如果今天只想快速补全 multi-agent self-evolution 的术语和脉络，这篇 survey 值得浏览。对研究而言，重点不是 survey 本身，而是它把 failure attribution 作为 collaboration 到 self-evolution 的中间必要环节：没有归因，就难以把失败转成结构性改进。

#2. 今日最值得精读的 3 篇

Self-Distilled Agentic Reinforcement Learning

https://arxiv.org/abs/2605.15155

理由：最直接贴近 agentic RL / 长轨迹 agent 训练，重点看 gated token-level self-distillation 如何与 RL 主干结合。

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

https://arxiv.org/abs/2605.14392

理由：高度相关于“通过环境设计催生自演化智能”和 model-based/environment-based RL。重点看 solve-verify asymmetry、环境生成器与 verifier 的边界。

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

https://arxiv.org/abs/2605.15198

理由：对 latent-space reasoning 很有启发。重点看如何把显式 agentic intermediate states 压成离散 latent/action token，以及这种折中能否迁移到语言 agent。

备选精读：如果今天更关注系统落地，则把第三篇换成 Orchard；如果更关注长期 agent，则换成 EvolveMem。

#3. 今日最值得跟进的 3 个 repo / model / dataset

ZJU-REAL/SDAR

https://github.com/ZJU-REAL/SDAR/

跟进点：是否开放 agent rollout 数据、训练脚本、teacher privileged context 设计；可作为 agentic RL 训练 pipeline 的近期基线。

microsoft/Orchard

https://github.com/microsoft/Orchard/tree/readme-only

跟进点：agentic modeling infrastructure 的抽象方式，尤其 environment service、sandbox lifecycle、harness 与 rollout/reward 接口。

InternLM/WildClawBench

https://github.com/InternLM/WildClawBench/

跟进点：真实 CLI agent benchmark 的任务格式、Docker 环境、grading protocol；可用于检验 Hermes/OpenClaw 风格 agent 的长轨迹能力。

额外可跟：

ZiyuGuo99/ATLAS：https://github.com/ZiyuGuo99/ATLAS
aiming-lab/SimpleMem/EvolveMem：https://github.com/aiming-lab/SimpleMem/tree/main/EvolveMem
TIGER-AI-Lab/RewardHarness：https://github.com/TIGER-AI-Lab/RewardHarness

#4. 研究机会 / idea

#Idea 1：把 SDAR 式 self-distillation 接到 code agent 的 verifier trace 上

SDAR 使用 privileged teacher branch 给 agent 提供 token-level dense guidance。对 code agent，可以把 privileged context 设计成 hidden tests 的失败类型、coverage 信息、静态分析结果、repo dependency graph、oracle patch 或 human patch 的局部 diff、执行 trace / stack trace。

研究问题：哪些 privileged context 在训练时能提升 agent 策略，但不会造成推理时不可用信息依赖？ 可以做 gated distillation ablation：full trace、partial trace、only failure category、only execution state summary。

#Idea 2：从“自生成题目”升级到“自生成可验证 repo 环境”

Learning to Build the Environment 的可验证环境合成思想很适合 code intelligence。可以让 LLM 生成小型 repo、bug、测试、oracle patch、评分器，并要求满足 solve-verify asymmetry：模型能写 verifier，但不能轻易在自然语言中直接解出所有实例。

研究问题：如何检测自生成环境是否太简单或 verifier 泄漏？环境生成器如何形成 curriculum，而不是不断生成同质任务？能否训练一个 repo-world-model，预测修改某文件后测试/行为如何变化，再用真实执行校正？

#Idea 3：把 agent memory 的“检索策略”作为可学习 action space

EvolveMem 把 retrieval configuration 暴露为结构化 action space，但主要由 LLM diagnosis module 驱动。可以进一步做 RL 化：把 memory retrieval/fusion/forgetting policy 看作 agent 的一部分，用任务成功率、冲突检测、过期记忆拒绝、长期一致性作为 reward。

研究问题：memory policy 是否应当按任务阶段动态切换？如探索、修 bug、总结、提交前审查。对 code agent，哪些 memory 应当长期保留，哪些只应在 episode 内保留？STALE 式 implicit conflict 能否作为 memory continual learning 的标准训练/评测任务？

#5. 来源与访问说明

Hugging Face Daily Papers 页面可访问，并用于发现今日候选论文。
arXiv API 在本次运行中返回 429 Rate exceeded；已改用 https://arxiv.org/abs/... 页面逐条核验标题、日期、摘要与分类。
GitHub 项目链接通过 DuckDuckGo 与 GitHub 页面搜索交叉发现；未找到稳定项目页的条目未强行编造 repo。
X/Twitter 未作为事实来源使用；如需后续跟踪作者讨论，建议人工打开对应论文作者/实验室账号确认。