每日调研 2026-05-13 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-13 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-13 早间（Asia/Shanghai）。主要覆盖 arXiv / Hugging Face Papers 最近 24-48 小时；由于 arXiv API 出现 429，本文改用 arXiv recent HTML、论文详情页与 Hugging Face Papers 页面交叉核验。X/Twitter 未作为主证据源，优先采用可公开访问的论文页、HF 与 GitHub/HF 数据集页面。

#0. 今日判断：Agent RL 正在从“训练一个策略”转向“训练可演化的外部状态”

今天最值得关注的主线非常集中：长程 Agent 不再只是在 prompt / tool-use 上做工程，而是在把 memory、skill、execution trace、rubric、on-policy data evolution 都纳入 RL 或元控制对象。这与 wenjun 关注的 LLM Agent、model-based RL / Dreamer for LLM Agent、长轨迹 RL、代码智能非常贴合。

一个可以概括今日进展的判断是：

未来的 Agent 后训练可能不是单纯 RLVR，而是“策略 + 技能库 + 记忆压缩器 + 环境执行轨迹 + rubrics/world-state”的联合优化；其中外部状态是否可 fork、可 replay、可被奖励函数读取，会决定 self-evolving agent 是否真正可扩展。

#1. 重点精读：3-5 条最值得关注

#1.1 Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

类别：LLM Agent / Tool-use / Code Agent / Systems
链接：https://arxiv.org/abs/2605.10913
来源：arXiv cs.AI / cs.PL / cs.SE
日期：Submitted on 11 May 2026
一句话贡献：提出 Shepherd，把 agent-environment 交互记录为类似 Git 的 typed execution trace，允许过去状态 fork / replay，并用 Lean 形式化核心 meta-agent 操作。

为什么值得关注：

Shepherd 命中了长程 Agent 研究的一个底层缺口：现在很多 agent benchmark 只记录最终结果或非结构化日志，但真正要做 agentic RL、failure recovery、self-evolution，需要能够把环境状态、文件系统、工具调用、agent 内部决策都变成可回放、可分支的轨迹对象。论文摘要中声称其 fork agent 进程和文件系统比 Docker 快 5 倍，并能实现超过 95% prompt-cache reuse，这说明它不是单纯概念框架，而是试图做一个训练/评测 runtime substrate。

与 wenjun 方向的关系：

对 model-based RL for LLM Agent 很关键：Dreamer 式方法需要 world model 或至少可 replay 的 state transition；Shepherd 的 typed trace 可以成为“语言 Agent 版 replay buffer / environment state graph”。
对 代码 Agent RL 很关键：代码任务的环境状态高度依赖文件系统、测试结果、命令行上下文，Git-like trace 比自然语言 summary 更适合作为训练监督。
可思考：是否能在 Shepherd-like trace 上训练一个“下一步价值模型 / 分支选择模型”，而不是只训练 final answer。

#1.2 Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

类别：LLM Agent / Post-training RL / Tool-use / Skill Learning
链接：https://arxiv.org/abs/2605.10923
来源：arXiv cs.LG / cs.CL；Hugging Face Papers
日期：Submitted on 11 May 2026
一句话贡献：提出 agentic RL 中外部 skills 不应只单调累积或最终全部内化，而应做动态 lifecycle 管理：不同任务和训练阶段激活不同 skill set。

为什么值得关注：

这篇文章直接挑战了很多 self-evolving agent 的隐含假设：经验/技能越积越多一定更好，或者最终都应该 internalize 到 policy。作者认为在参数容量有限、不同 skill 边际收益不均匀时，最优 active skill set 是非单调的、任务相关的、阶段相关的。这与现实 agent 系统非常一致：一个不断增长的 skill library 会带来检索噪声、上下文污染、错误迁移和维护成本。

与 wenjun 方向的关系：

对 长轨迹 RL：skill 选择本身可以成为高层 action，底层 token/action policy 只是执行器。
对 agent 预训练数据如何塑造能力：可以把 skill lifecycle 看作“外部化能力形成机制”，研究哪些经验应进入参数、哪些应留在检索/工具层。
对 self-evolving code agent：代码 agent 的 patch pattern、debug recipe、工具脚本都可视为 skills；关键问题不是存多少，而是何时激活、何时遗忘、何时合并。

#1.3 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

类别：Post-training RL / LLM Agent / Evaluation / Research Agent
链接：https://arxiv.org/abs/2605.10899
来源：arXiv cs.CL / cs.LG
日期：Submitted on 11 May 2026
一句话贡献：面向 deep research agents，提出用 rubrics 不只是打分，而是作为 memory-like objective，将开放式任务拆成可复用的 policy 改进信号。

为什么值得关注：

RLVR 在数学、代码、可验证问答中有效，但 deep research agent 的输出没有唯一 ground truth，轨迹又跨越搜索、证据评估、综合写作等多个阶段。RubricEM 的重要点在于：把 rubric 从“评价表”提升为“策略分解与经验复用机制”，试图突破 verifiable reward 的边界。

与 wenjun 方向的关系：

对 从指令理解走向意图理解：rubric 往往显式编码用户真正关心的维度，比单一 reward 更接近 intent。
对 Agent RL：可把 rubric 当作 latent task specification 或 high-level value decomposition。
对 科研助手/代码助手：这类任务很难靠单元测试闭环，rubric-guided RL 可能是开放式 agent 后训练的现实路径。

#1.4 Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

类别：LLM Agent / Memory / Context Compression / Long-horizon Agent
链接：https://arxiv.org/abs/2605.10870
来源：arXiv cs.AI
日期：Submitted on 11 May 2026
一句话贡献：用 rate-distortion 框架重新定义 agent memory：记忆的价值不在于忠实描述过去，而在于在固定预算下保留会影响未来决策的历史差异。

为什么值得关注：

这篇很适合连接 wenjun 关注的“通用上下文压缩器”。多数 memory 方法优化 relevance、salience、summary quality，但这些指标未必等价于决策价值。Rate-distortion 视角的关键是：在 memory budget 下，哪些历史差异如果被压扁，会导致未来 action/value 变化？这比“摘要得像不像原文”更贴近 agent。

与 wenjun 方向的关系：

对 LLM Agent memory compression：可以训练 decision-preserving compressor，而不是 text-preserving summarizer。
对 model-based RL：memory compression 可对应 belief state abstraction；压缩目标应保留 policy/value-relevant state。
对 长程代码任务：不是所有日志都值得记，真正该保留的是会改变 debug 决策的 evidence。

#1.5 SEIF / G-Zero / Evolving-RL：self-evolving RL 的三条新线索

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

- 链接：https://arxiv.org/abs/2605.07465

- 来源：arXiv cs.CL；Hugging Face Papers

- 日期：Submitted on 8 May 2026

- 核心贡献：构建 instruction-following 的闭环自演化 RL 框架，用模型自身能力演化训练指令难度。

G-Zero: Self-Play for Open-Ended Generation from Zero Data

- 链接：https://arxiv.org/abs/2605.09959

- 来源：arXiv cs.LG / cs.AI / cs.CL；Hugging Face Papers

- 日期：Submitted on 11 May 2026

- 核心贡献：提出 verifier-free 的 co-evolution 框架，用 Hint-δ 衡量无辅助回答与自生成 hint 条件回答之间的 predictive shift，驱动 Proposer 持续寻找 Generator 盲点。

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

- 链接：https://arxiv.org/abs/2605.10663

- 来源：arXiv cs.AI

- 日期：Submitted on 11 May 2026

- 核心贡献：把 experience-driven self-evolving agent 从系统设计问题推进到端到端优化问题，强调 foundation model 的抽象、泛化、ICL 能力如何承载经验演化。

为什么值得关注：

这三篇共同说明：self-evolving 不再只是“让模型生成更多训练数据”，而是在研究难度自适应、盲点发现、经验抽象、开放式任务中没有 verifier 时如何形成训练信号。G-Zero 的 Hint-δ 尤其值得关注，因为它不是依赖外部 judge，而是利用模型在 hint 前后的分布/输出差异来构造 intrinsic reward。

与 wenjun 方向的关系：

对 通过环境设计催生自演化智能：关键在于设计能持续暴露盲点的 proposer/environment。
对 LLM model-based RL：Hint-δ 可理解为模型内部不确定性/可学习性信号，类似 world model 中的 surprise / learning progress。
对 代码 Agent：可用“测试失败前后、hint 前后、trace repair 前后”的差异作为 self-improvement reward。

#2. 其他值得扫读的论文/动态

#2.1 TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

类别：LLM Agent / Test-time Scaling / Multi-agent
链接：https://arxiv.org/abs/2605.10344
来源：arXiv cs.AI；Hugging Face Papers
日期：Submitted on 11 May 2026
一句话贡献：通过多 Agent 协同来组织测试时计算，解决并行 reasoning trajectories 协调弱、历史信息保留/复用噪声大的问题。
简评：适合作为 test-time scaling 与 multi-agent reasoning 交叉的参考；可重点看其如何决定哪些历史信息应被 retained and reused。

#2.2 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

类别：LLM Agent / Evaluation / Long-horizon / Tool-use
链接：https://arxiv.org/abs/2605.10912
来源：arXiv cs.CL
日期：Submitted on 11 May 2026
一句话贡献：提出面向真实 runtime 的长程 agent 评测，避免只在 synthetic sandbox、短任务、mock API 和 final-answer check 中评估 agent。
简评：和 wenjun 的 agent 研究高度相关，尤其值得关注其任务环境、评分协议、失败类型 taxonomy。

#2.3 ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

类别：LLM Agent / Tool-use / Evaluation / MCP
链接：https://arxiv.org/abs/2605.10787
来源：arXiv cs.AI / cs.SE
日期：Submitted on 11 May 2026
一句话贡献：基于 Model Context Protocol 构建动态、相互依赖、大规模工具 sandbox，评估 agent 在商业软件自动化“最后一公里”的能力。
简评：MCP 正成为工具生态接口标准，ComplexMCP 的关键价值在于把“工具不是独立 API”这个现实放进 benchmark。

#2.4 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

类别：LLM Agent / Research Agent / Memory / Personalization
链接：https://arxiv.org/abs/2605.10813
来源：arXiv cs.AI
日期：Submitted on 11 May 2026
一句话贡献：面向个性化科研自动化，让 skills、memory、policy 针对不同研究者偏好和资源配置共同演化。
简评：与科研助手产品形态直接相关；可观察其个性化信号如何进入 agent policy。

#2.5 Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

类别：LLM Agent / Multimodal Agent / Data Evolution / Tool-use
链接：https://arxiv.org/abs/2605.10832
来源：arXiv cs.CL
日期：Submitted on 11 May 2026
一句话贡献：指出 multimodal deep search 的两个瓶颈：视觉证据不能被后续工具反复消费、训练数据多由固定离线流水线构造；提出 on-policy data evolution 方向。
简评：虽然偏多模态，但“on-policy 数据演化”对通用 agent 后训练有启发。

#2.6 Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

类别：Continual Learning / Post-training / Forgetting
链接：https://arxiv.org/abs/2605.09608
来源：arXiv cs.LG / cs.IT；Hugging Face Papers
日期：Submitted on 10 May 2026
一句话贡献：从几何冲突角度解释 LLM continual post-training 中何时发生能力迁移、何时发生灾难性遗忘，并尝试给出控制准则。
简评：对持续学习和高效后训练非常相关；可重点看其是否提供可计算的 update compatibility 指标。

#2.7 Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

类别：Continual Learning / Continual Pretraining / Knowledge Acquisition
链接：https://arxiv.org/abs/2605.10640
来源：arXiv cs.CL / cs.AI
日期：Submitted on 11 May 2026
一句话贡献：围绕 continual factual knowledge acquisition 建立理论框架，并讨论 replay 等 CPT 技术背后的训练动态。
简评：适合作为“持续预训练如何注入新事实而不擦除旧知识”的机制研究入口。

#2.8 Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

类别：Pretraining / Long Context / Data Curriculum
链接：https://arxiv.org/abs/2605.10544
来源：arXiv cs.CL
日期：Submitted on 11 May 2026
一句话贡献：指出 packed long-context 训练中目标 token 的 effective context 仍可能偏短，提出 EXACT 对长 effective-context token 进行重加权。
简评：这篇对“长上下文能力形成机制”很有价值：窗口变长不等于监督真的来自长上下文。

#2.9 Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

类别：Pretraining Mechanism / Optimization / Foundation Model Training
链接：https://arxiv.org/abs/2605.10504
来源：arXiv cs.CL
日期：Submitted on 11 May 2026
一句话贡献：发现 GPT 预训练中上层 attention 过早形成尖锐模式会伤害训练，临时放慢上层 Q/K projection 学习可提升 perplexity 与下游效果。
简评：属于基础模型能力形成机制方向，值得看实验是否稳定跨模型/数据规模成立。

#2.10 Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

类别：Latent Reasoning / Efficient Inference / Architecture
链接：https://arxiv.org/abs/2605.07721
来源：arXiv cs.CL / cs.AI / cs.LG；Hugging Face Papers
日期：Submitted on 8 May 2026
一句话贡献：面向 looped/recurrent LLM 的 embedding-space 多步推理，解决 KV cache 随 reasoning depth 线性增长的问题。
简评：与 latent-space reasoning 直接相关；关键是把 test-time compute 从显式 token 链条转向隐空间迭代，同时控制 memory。

#2.11 Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

类别：Memory / Test-time Adaptation / Latent Reasoning
链接：https://arxiv.org/abs/2605.10537
来源：arXiv cs.CL
日期：Submitted on 11 May 2026
一句话贡献：借鉴神经科学的 memory consolidation 与 cross-frequency coupling，提出 Hierarchical Memory Module，在 test time 把短暂经验转成更稳定结构。
简评：可和上下文压缩、agent memory、latent state update 放在一起看。

#2.12 Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

类别：Post-training RL / RLVR / Reasoning
链接：https://arxiv.org/abs/2605.10781
来源：arXiv cs.LG / cs.CL；Hugging Face Papers
日期：Submitted on 11 May 2026
一句话贡献：指出 self-distillation 中 teacher 在学生成功路径上可能压制学生自身推理，提出反向读取 teacher signal 来鼓励探索。
简评：对 RLVR 和 reasoning exploration 很有意思，尤其适合关注“teacher-supported region / cold start / exploration”问题。

类别：LLM Agent / Knowledge Base / RL / Memory
链接：https://arxiv.org/abs/2605.10488
来源：arXiv cs.CL / cs.AI
日期：Submitted on 11 May 2026
一句话贡献：用 RL 改进 agent-compiled knowledge bases，处理缺失证据、错误/低置信 claim、冗余和指代歧义等问题。
简评：可看作“外部长期记忆的 RL 清洗/维护”，适合与 agent memory lifecycle 联读。

#2.14 Trajectory Supervision for Continual Tool-Use Learning in LLMs

类别：Tool-use / Continual Learning / Agent Training Data
链接：https://arxiv.org/abs/2605.09734
来源：arXiv cs.SE / cs.AI / cs.MA
日期：Submitted on 10 May 2026
一句话贡献：研究持续学习新 API domain 时，保留工具调用轨迹是否比只保留最终产物更有利。
简评：与“agent 预训练数据如何塑造能力”高度相关：过程监督可能是 tool-use continual learning 的关键。

#2.15 Instruction Adherence in Coding Agent Configuration Files

类别：Code Agent / Evaluation / Prompt-Config Reliability
链接：https://arxiv.org/abs/2605.10039
来源：arXiv cs.SE / cs.CL
日期：Submitted on 11 May 2026
一句话贡献：系统研究 CLAUDE.md、AGENTS.md、Cursor Rules 等 coding agent 配置文件的结构变量如何影响指令遵循。
简评：非常贴近日常 agent 工程；可为 repo-level coding agent benchmark 增加“配置遵循”维度。

#2.16 Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

类别：Code Agent / Failure Recovery / Software Engineering
链接：https://arxiv.org/abs/2605.08717
来源：arXiv cs.SE / cs.AI
日期：Submitted on 9 May 2026
一句话贡献：提出 PROBE，将异构运行时证据转成有边界、可接地的恢复指导，用于软件工程 agent 失败后的下一次尝试。
简评：适合与 Shepherd trace、trajectory supervision 联读：失败恢复本质上需要结构化轨迹和可操作诊断。

#2.17 Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

类别：Code Intelligence / Code Generation / Evaluation
链接：https://arxiv.org/abs/2605.08680
来源：arXiv cs.SE / cs.AI / cs.LG
日期：Submitted on 9 May 2026
一句话贡献：比较 18 种代码生成候选选择配置，研究文本投票、执行一致性、语义投票等组件的相对贡献。
简评：对 test-time scaling for code 很实用；可关注 execution-grounded consensus 如何减少无 oracle 场景下的选择错误。

#2.18 Compute Where it Counts: Self Optimizing Language Models

类别：Efficient Inference / Test-time Compute / Systems
链接：https://arxiv.org/abs/2605.10875
来源：arXiv cs.LG / cs.CL
日期：Submitted on 11 May 2026
一句话贡献：研究 autoregressive decoding 中按 token 难度动态分配计算预算，而不是对每个 token 使用固定压缩/固定计算。
简评：可和 latent reasoning / test-time scaling 结合：不是所有 token 都值得同等“思考”。

#3. 今日最值得精读的 3 篇

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

https://arxiv.org/abs/2605.10913

精读原因：它可能是长程 Agent RL 所需的基础设施层：可 fork、可 replay、可结构化训练。

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

https://arxiv.org/abs/2605.10923

精读原因：直接讨论 agentic RL 中外部 skill 的动态管理，和 self-evolving code/research agent 高度相关。

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

https://arxiv.org/abs/2605.10870

精读原因：给 agent memory / context compression 一个更正确的优化目标：保留决策相关差异，而不是生成漂亮摘要。

备选精读：

RubricEM：如果今天想看开放式任务 RL。
Memory-Efficient Looped Transformer：如果今天想看 latent-space reasoning 架构。
Where Does Long-Context Supervision Actually Go?：如果今天想看基础模型长上下文训练机制。

#4. 今日最值得跟进的 3 个 repo / model / dataset

说明：今日未发现所有论文都提供稳定公开 repo；以下优先选择能公开访问、与 agent/code/data 相关且近期活跃的资源。

open-thoughts/AgentTrove

- 链接：https://huggingface.co/datasets/open-thoughts/AgentTrove

- 来源：Hugging Face Datasets trending

- 状态：页面显示约 1.7M 条，近期更新

- 为什么跟进：大规模 agent traces 数据集，适合研究 agent 预训练数据、轨迹监督、tool-use 能力形成。

lambda/hermes-agent-reasoning-traces

- 链接：https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces

- 来源：Hugging Face Datasets trending

- 状态：约 14.7k 条，近期仍有热度

- 为什么跟进：与实际 agent reasoning trace 相关，可用于分析长轨迹中哪些步骤值得监督、压缩或作为 RL credit assignment 单元。

ADSKAILab/Zero-To-CAD-1m

- 链接：https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m

- 来源：Hugging Face Datasets trending；与 BenchCAD 方向相关

- 状态：约 1M 规模，近期更新

- 为什么跟进：程序化 CAD / code-like generation 数据，对“代码智能不只是 Python/算法题”的泛化研究有参考价值。

可顺手关注：

Qwen/Qwen3.6-27B：https://huggingface.co/Qwen/Qwen3.6-27B
Qwen/Qwen3.6-35B-A3B：https://huggingface.co/Qwen/Qwen3.6-35B-A3B
TuringEnterprises/Open-MM-RL：https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL

#5. 研究机会 / idea

#Idea 1：把 Shepherd-like execution trace 作为 LLM Agent 的 model-based RL state graph

现有 LLM Agent RL 多数仍把轨迹当作文本序列或 prompt history。可以尝试把 agent 执行过程表示为 typed trace graph：节点是环境状态、文件系统快照、工具返回、测试结果、memory 更新；边是 agent action。然后训练：

state abstraction / belief compressor：压缩到决策相关 latent state；
value model：预测某个 trace prefix 后续成功概率；
branch policy：选择 fork 哪条分支继续探索；
world model：预测 tool/action 对环境状态的影响。

这会比纯 token-level RL 更接近 Dreamer-style agent learning。

#Idea 2：Decision-preserving memory compressor for code/research agents

基于 rate-distortion memory 的思想，设计一个压缩器，不以 ROUGE/summary quality 为目标，而以“压缩前后 agent 下一步决策分布/任务成功率是否保持”为目标。实验可以从代码 debug 或 research QA 开始：

输入：长日志、搜索结果、测试失败记录、历史 patch；
压缩目标：保留会改变下一步 debug/search/edit 决策的信息；
训练信号：next-action KL、value preservation、最终任务成功率。

这能连接上下文压缩、agent memory、长轨迹 RL 三条线。

#Idea 3：Skill lifecycle for self-evolving code agent

把代码 agent 的可复用经验拆成 skills：如“pytest 失败 triage”“依赖冲突修复”“AST 定位”“并发 bug 复现”“benchmark profiling”。研究问题不是如何无限增加 skill，而是：

skill 何时进入库？
何时合并/抽象？
何时被标记为过时或危险？
何时应被 internalize 到模型参数，何时保留为外部工具/文档？

可以用 repo-level benchmark 做长期训练，评价 skill library 增长对成功率、上下文污染、推理成本的影响。

#6. 快速阅读路线

如果今天只有 30 分钟：

先看 Shepherd 摘要和系统设计图，判断它的 trace schema 是否能用于自己的 agent 实验。
再看 Dynamic Skill Lifecycle 的 problem formulation，关注 skill active set 的非单调性假设。
最后看 Rate-Distortion Agent Memory，把它转写成自己的 memory compressor objective。

如果今天有 2 小时：

加读 RubricEM、G-Zero、Trajectory Supervision for Continual Tool-Use Learning；
把它们统一到一个框架里：外部状态如何被记录、压缩、评分、选择、演化。

#2026-05-13 AI/LLM 最新论文与研究热点简报

#0. 今日判断：Agent RL 正在从“训练一个策略”转向“训练可演化的外部状态”

#1. 重点精读：3-5 条最值得关注

#1.1 Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

#1.2 Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

#1.3 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

#1.4 Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

#1.5 SEIF / G-Zero / Evolving-RL：self-evolving RL 的三条新线索

#2. 其他值得扫读的论文/动态

#2.1 TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

#2.2 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

#2.3 ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

#2.4 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

#2.5 Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

#2.6 Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

#2.7 Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

#2.8 Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

#2.9 Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

#2.10 Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

#2.11 Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

#2.12 Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

#2.13 DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

#2.14 Trajectory Supervision for Continual Tool-Use Learning in LLMs

#2.15 Instruction Adherence in Coding Agent Configuration Files

#2.16 Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

#2.17 Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

#2.18 Compute Where it Counts: Self Optimizing Language Models

#3. 今日最值得精读的 3 篇

#4. 今日最值得跟进的 3 个 repo / model / dataset

#5. 研究机会 / idea

#Idea 1：把 Shepherd-like execution trace 作为 LLM Agent 的 model-based RL state graph

#Idea 2：Decision-preserving memory compressor for code/research agents

#Idea 3：Skill lifecycle for self-evolving code agent

#6. 快速阅读路线