每日调研 2026-07-01 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-07-01 AI/LLM 最新论文与研究热点简报

检索时间：2026-07-01 08:00 左右（Asia/Shanghai）
主要覆盖：Hugging Face Daily Papers 2026-06-29 / 2026-06-30、arXiv 链接、论文项目页、GitHub / Hugging Face API。
说明：arXiv API 在本次检索中多次超时 / 429，故以 Hugging Face Daily Papers 页面中的 arXiv 元数据、项目页与 GitHub 仓库为主交叉验证。X/Twitter 未作为一手来源使用（无稳定登录/API 环境），用 HF、项目页、GitHub、arXiv 替代。

#0. 今日判断：Agent 研究正在从“能做任务”转向“长程交互中的可控、可归因、可停止、可评测”

过去 24-48 小时最值得 wenjun 关注的不是单个 benchmark，而是一组很一致的信号：

Agent horizon scaling 开始被当作和参数量 scaling 并列的能力来源：Agents-A1 直接把 thesis 写成 “Scaling the Horizon, Not the Parameters”。
长程 Agent 评测开始逼近真实工作流：OSWorld2.0、TUA-Bench、SWE-Together 都在把任务从静态、短链条、一次性 prompt，推向长 horizon、交互式用户会话、真实 terminal / GUI / repo workflow。
Agentic RL 的 credit assignment 正在细化到工具调用 / 多 Agent 连接 / 交互深度：TACO、GBC、ProMSA 等都在不同场景里试图解决“到底是哪一步工具调用或哪个子 Agent 带来最终收益”。
“何时不做”成为 Agent 能力的一部分：Agentic Abstention 把停止、拒绝、继续搜集信息建模成 sequential decision problem，这对长轨迹 RL 和真实部署都很关键。
训练系统层面也在补齐长程 post-training 的瓶颈：AsyncOPD、异步 pipeline pretraining、KV/context compression 相关工作都指向一个问题：如果 Agent / reasoning 训练需要大量 rollout，系统吞吐、staleness、上下文预算会成为核心约束。

#1. 重点论文 / 动态筛选

#1. Agentic Abstention: Do Agents Know When to Stop Instead of Act?

类别：LLM Agent / Evaluation / Safety / Sequential Decision
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-27
链接：arXiv；项目页
一句话贡献：将 Agent 的“是否应该继续行动、回答或停止/弃权”定义为一个多步序贯决策问题，而不是传统单轮 abstention。

为什么值得关注：

长程 Agent 的失败往往不是“某一步工具调用错了”，而是“在问题已不充分、环境已不可达、继续探索只会放大成本时还在行动”。这篇工作的价值在于把 abstention 从 QA 场景推进到 web shopping、terminal、QA 等交互环境，评估 13 个 LLM-as-agent 系统与 2 个 agent harness。

与 wenjun 研究方向的关系：

如果你做 LLM model-based RL / Dreamer for Agent，abstention 可以视作 world model 不确定性下的终止动作；它对应的不是普通 refusal，而是“继续探索的价值是否超过成本”。这很适合和 value of information、uncertainty-aware planning、long-horizon RL 的 termination policy 结合。

#2. Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

类别：LLM Agent / Long-horizon Agent / Post-training / Agent Data
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-29
链接：arXiv；项目页；GitHub；HF model
一句话贡献：提出 35B MoE Agentic Model Agents-A1，通过平均 45K tokens 的长程 knowledge-action-verifier 轨迹、三阶段训练、多 teacher domain-routed on-policy distillation，声称用 horizon scaling 达到接近 trillion-parameter agent 性能。

为什么值得关注：

这篇直接把“Agent 能力来自更长、更异质、更可验证的行动轨迹”作为核心 scaling 轴。它不是单纯扩大 base model，而是扩大 agent horizon、任务域与 verifier 反馈的覆盖。

与 wenjun 研究方向的关系：

这和你关注的“agent 预训练数据如何塑造能力”高度相关：如果 average trajectory length 到 45K，模型学到的可能不只是 instruction following，而是跨 observation-action-verifier 的长期结构。值得追问：这种长轨迹数据中的 latent state / subgoal 是否可以显式建模？能否用 Dreamer-style latent dynamics 压缩长轨迹并做 planning？

#3. TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

类别：LLM Agent / Tool-use / Post-training RL / Credit Assignment / GRPO
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-29
链接：arXiv
一句话贡献：提出 TACO，一个面向 code-tool / multimodal agent 的 GRPO 变体，用两个 advantage channel 给工具调用分配 credit，尤其通过 Differential Answer-Probe Reward 判断某个工具调用对最终回答是否真的有贡献。

为什么值得关注：

工具调用型 Agent 的核心难题是 outcome reward 太粗：最后答对了，不代表每次 tool call 都有用；最后答错了，也可能某些中间工具调用是正确的。TACO 试图用“有/无该工具调用时模型预测差异”来做 judge-free 的工具贡献估计。

与 wenjun 研究方向的关系：

这对长轨迹 Agent RL 很关键：如果要训练 self-evolving code agent 或 tool-use agent，credit assignment 不能只落在最后 pass/fail。TACO 的思路可被扩展到代码 Agent：比如每次 grep / read_file / test / patch 的边际贡献，用 probe 或 counterfactual rollout 估计。

#4. OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

类别：LLM Agent / Computer-use / Evaluation / Long-horizon
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-28
链接：arXiv；项目页；GitHub；HF dataset
一句话贡献：推出 108 个长程真实 computer-use workflow；论文摘要称人类中位完成时间约 1.6 小时，Claude Opus 4.7 平均约 318 次 tool calls，显著长于 OSWorld 1.0。

为什么值得关注：

OSWorld2.0 把 computer-use agent 从短任务推到真实 workflow：跨源推理、动态环境、流式交互、长时间状态维护。它是目前最贴近“真实桌面/浏览器/文件工作流”的长 horizon benchmark 之一。

与 wenjun 研究方向的关系：

如果做 model-based RL for LLM Agent，OSWorld2.0 这类任务非常适合作为 world model / latent state tracking 的实验场：状态不可完全观测、动作空间复杂、成本高、回报稀疏，正好需要模型预测、子目标分解和记忆压缩。

#5. SWE-Together: Evaluating Coding Agents in Interactive User Sessions

类别：Code Agent / Evaluation / Interactive Coding / User Simulation
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-29
链接：arXiv；项目页；GitHub
一句话贡献：从真实 user-agent coding sessions 中重构 109 个 repo-level 多轮任务，并用 reactive LLM-based user simulator replay 用户意图和反馈。

为什么值得关注：

现有 SWE-bench 风格任务通常把完整需求一次性给 Agent，但真实 coding assistant 是用户不断补充约束、纠正误解、协同探索。SWE-Together 把“交互式意图理解”纳入评测。

与 wenjun 研究方向的关系：

这正对应“从指令理解走向意图理解”。如果你要研究 code agent 的 agentic RL，SWE-Together 提供了比静态 patch 更接近真实产品形态的反馈循环：最终 repo correctness + corrective feedback turns 都可作为 reward / cost 信号。

#6. TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

类别：LLM Agent / Terminal-use / Evaluation / Tool-use
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-26
链接：arXiv；GitHub
一句话贡献：提出 120 个 general-purpose terminal-use tasks，覆盖文档编辑、邮件管理、live-web 信息检索，以及 PhD 级科学/工程 workflow。

简评：

Terminal 是 coding agent 与 research agent 的共同接口。TUA-Bench 的意义在于摆脱“terminal=写代码/跑测试”的窄定义，转向更宽的 shell-native digital work。对 Hermes / OpenHands / Claude Code 类系统尤其值得关注。

#7. AsyncOPD: How Stale Can On-Policy Distillation Be?

类别：Post-training RL / On-policy Distillation / Systems / Reasoning Models
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-23
链接：arXiv；GitHub
一句话贡献：系统研究 asynchronous on-policy distillation 中 stale-policy data 的影响，关注 rollout 与 learner 解耦后 teacher feedback / KL cache 如何影响训练稳定性。

简评：

LLM reasoning / agent post-training 中 rollout 成本越来越高，异步化几乎不可避免。这篇对你关心的长轨迹 RL 系统实现很有价值：如果 rollout 需要几十 K tokens，on-policy 的“新鲜度”会成为吞吐与稳定性的核心 trade-off。

#8. One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

类别：Systems / Pretraining / Pipeline Parallelism / Training Mechanism
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-29
链接：arXiv
一句话贡献：挑战“一步梯度延迟会导致大规模异步 pipeline LLM 预训练不稳定”的常见假设，指出退化强依赖 optimizer choice，而非 staleness 本身不可行。

简评：

如果成立，这对基础模型训练系统很重要：异步 pipeline 能减少 bubble，但长期被 staleness 担忧限制。它和 AsyncOPD 一起说明：训练大模型/Agent 时，系统异步带来的 stale signal 并非绝对不能用，关键是优化器和缓存/延迟设计。

#9. ReFreeKV: Towards Threshold-Free KV Cache Compression

类别：Context Compression / Inference / Long-context / Systems
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 页面显示 2502.16886，HF 记录发布日期 2026-06-26；GitHub 2026-06-30 更新
链接：arXiv；GitHub
一句话贡献：面向 KV cache pruning，提出 threshold-free 的压缩目标，避免为不同输入/领域预先调 KV budget threshold。

简评：

长上下文 Agent 的 bottleneck 不只是模型是否会推理，还包括历史观察/工具结果如何压缩。ReFreeKV 的方向值得和“通用上下文压缩器”关联：真正可部署的 Agent 不能依赖每个任务手调压缩阈值。

#10. GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

类别：GUI Agent / Agent Data / Weak Supervision / Pretraining Data
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-29
链接：arXiv；GitHub
一句话贡献：利用大规模未标注 screenshots 做弱监督 GUI agent 训练，降低对昂贵人工 GUI action annotation 的依赖。

简评：

这和“agent 预训练数据如何塑造能力”直接相关。GUI 数据不像网页文本可自然爬取 action trace；GUICrafter 的核心问题是如何从静态 screenshot 中挖出可训练的 grounding / action prior。

#11. Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

类别：Multimodal Agent / GUI Agent / Video Understanding / Tool-use
来源与日期：Hugging Face Daily Papers，2026-06-30；arXiv 论文发布日期 2026-06-28
链接：arXiv；项目页；GitHub
一句话贡献：提出 VG-GUIBench，评估 MLLM-based GUI agents 是否能从视频教程中学习流程并迁移到长程 GUI 任务，同时提出 TASKER 做通用 keyframe extraction。

简评：

这篇把“视频理解”从 QA 推向 procedural skill learning：Agent 不只是回答视频内容，而是把教程转为可执行 GUI 行为。对 agent pretraining data 来说，公开视频教程可能是低成本行为先验来源。

#12. Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

类别：Latent Reasoning / Interpretability / Representation
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-05-07；HF 当日热度较高
链接：arXiv；项目页；GitHub
一句话贡献：试图用四个 axiom 形式化 LLM 中“latent thoughts / thought representation”的性质。

简评：

虽然不是 24 小时内新发，但在 6 月 29 日 HF Daily Papers 中热度高，且与 wenjun 近期关注的 latent-space reasoning 高度相关。建议重点看它如何定义 thought representation 的可分离性、可组合性或可读出性；这可能为“潜空间推理是否真的存在”提供更严格的操作化定义。

#13. AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

类别：LLM Agent / Continual Learning / Long-horizon / World Modeling
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-05-29；GitHub 近期仍更新
链接：arXiv；项目页；GitHub
一句话贡献：构造开放式长程 text game，用于评测 test-time continual learning agents 的探索、世界知识获取、episodic memory 和长期规划。

简评：

这篇非常适合作为 model-based RL for LLM Agent 的环境雏形：文本世界可程序生成、有动态、可持续学习、可观察到 memory / planning failure。相比 GUI/OSWorld，text game 更便于快速迭代算法。

#14. GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems

类别：Multi-Agent / Credit Assignment / Agent Optimization
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-26
链接：arXiv；GitHub / AgentChord
一句话贡献：把多 Agent 系统建模为计算图，用 gradient-based connection weights 在 token level 估计各 Agent 输出对下游结果的影响。

简评：

和 TACO 一样，这篇的核心是 credit assignment，只不过对象从 tool call 变成 multi-agent connection。对自演化 Agent 系统而言，关键不是堆更多角色，而是知道哪个角色/哪段通信真的有用。

#15. ProMSA: Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

类别：Multimodal Agent / Tool-use / RL / Search
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-26
链接：arXiv；项目页；GitHub
一句话贡献：提出 progressive multimodal search agent，在 image search、text search、stop 之间迭代选择，并用 TN-GSPO 这类考虑 generation length 与 tool-interaction depth 的序列级 RL 目标训练。

简评：

值得注意的是它把 stop action、tool budget、deduplication、interaction depth 都纳入训练目标。这与 Agentic Abstention 和 TACO 的问题意识一致：Agent 不只是“会调工具”，还要知道何时调、调几次、每次调的收益。

#16. How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

类别：Code Agent / Repository Understanding / Static Analysis / Evaluation
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-25
链接：arXiv
一句话贡献：研究轻量静态分析结构（调用图、继承层级、配置依赖等）作为 deterministic anchors 注入给代码 Agent 后，对定位、轨迹和 run-to-run stability 的影响。

简评：

这篇很对代码 Agent 的真实痛点：Agent 经常靠 keyword search 随机游走。静态结构未必让模型“更聪明”，但能让探索更稳定、可复现，降低轨迹方差。对 agentic RL 来说，这也可能显著降低 reward variance。

#17. To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

类别：Code Agent / Program Repair / Tool-use / Cost-aware Evaluation
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-25
链接：arXiv
一句话贡献：分析 LLM program repair 中 generate-run-revise 范式的执行成本与收益，覆盖 SWE-bench leaderboard traces 以及 3,000 次端到端 repair attempts。

简评：

对 code agent RL 很有启发：执行测试不是免费动作，reward 应同时考虑 correctness 与 execution cost。后续可以把“是否运行测试、运行哪个测试、何时停止运行”建模为 cost-aware policy。

#18. Simplified Sparse Attention via Gist Tokens

类别：Context Compression / Sparse Attention / Continued Pretraining
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 记录 2604.20920，HF 记录近期更新
链接：arXiv
一句话贡献：通过 continued pretraining 让 gist tokens 压缩 chunk 信息，推理时先用 gist tokens 选 top-k chunk，再展开原始 tokens，实现无需架构修改的 sparse attention。

简评：

这比单纯 KV pruning 更接近“可训练的上下文压缩器”：压缩 token 是模型内部学出来的。对长程 Agent 来说，可以把 observation history / tool outputs 分块压缩成 gist tokens，再按当前 query 动态展开。

#19. PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

类别：World Model / Embodied AI / RL / Evaluation
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-26
链接：arXiv；项目页；GitHub
一句话贡献：针对视频世界模型在物理交互中的不稳定，提出关注 physics-informative regions 的训练框架增强物理一致性。

简评：

虽然偏 embodied/video world model，但对 Dreamer for LLM Agent 有类比价值：world model 的错误常集中在“任务关键交互区域”，而不是平均像素/平均 token。LLM Agent 的世界模型也应关注 action-relevant state，而不是完整复述观察。

#20. Qwen-Image-2.0-RL Technical Report

类别：Post-training RL / RLHF / OPD / Generative Model
来源与日期：Hugging Face Daily Papers，2026-06-29；arXiv 论文发布日期 2026-06-25
链接：arXiv
一句话贡献：对 Qwen-Image-2.0 diffusion model 做 RLHF 与 on-policy distillation，构建 task-specific composite reward models，并用 GRPO-based RL 训练提升图像质量和指令跟随。

简评：

虽然不是 LLM Agent，但 reward model + GRPO + OPD pipeline 对 post-training 机制有参考价值，尤其是它如何组合多个 reward 维度并避免 RL 破坏预训练知识。

#2. 今日最值得精读的 3 篇

Agents-A1: Scaling the Horizon, Not the Parameters

精读理由：最直接击中“Agent 能力形成机制”：长轨迹、verifier outcomes、多 teacher on-policy distillation 如何替代单纯参数扩张。

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

精读理由：如果你要做长轨迹 Agent RL，工具调用级 credit assignment 是绕不过去的问题；TACO 提供了一个 judge-free / counterfactual-ish 的思路。

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

精读理由：把“停止/弃权”纳入 Agent sequential decision，对长程部署和 model-based planning 都很核心。

备选：如果今天更想看 code agent，则把第三篇换成 SWE-Together；如果更想看 benchmark/environment，则换成 OSWorld2.0。

#3. 今日最值得跟进的 3 个 repo / model / dataset

InternScience/Agents-A1

- GitHub：https://github.com/InternScience/Agents-A1

- HF model：https://huggingface.co/InternScience/Agents-A1

- 跟进点：长程 agentic trajectories、domain-routed teacher、on-policy distillation recipe 是否公开到可复现程度。

xlang-ai/OSWorld-V2 + HF datasets

- GitHub：https://github.com/xlang-ai/OSWorld-V2

- HF tasks：https://huggingface.co/datasets/xlangai/osworld_v2_tasks

- 跟进点：108 个真实长程 workflows 是否可用于训练/评测 model-based agent、memory compression、cost-aware tool policy。

Togetherbench/SWE-Together

- GitHub：https://github.com/Togetherbench/SWE-Together

- 项目页：https://togetherbench.com

- 跟进点：真实 user-agent coding sessions 的 replay 机制、reactive user simulator、corrective feedback turns 指标。

补充可关注：

TUA-Bench：https://github.com/facebookresearch/TUA-Bench
AsyncOPD：https://github.com/furiosa-ai/async-opd
ReFreeKV：https://github.com/Patrick-Ni/ReFreeKV

#4. 研究机会 / Idea

#Idea 1：把 Agentic Abstention 做成 model-based Agent 的 uncertainty-aware termination policy

现有 Agent 通常只学“下一步做什么”，较少系统学习“是否应该继续做”。可以构造一个 Dreamer-style / model-based RL 框架：

latent state 维护任务进展、环境可达性、信息缺口；
world model 预测继续搜索/执行后的 observation distribution 与成功概率；
policy 在 answer / abstain / gather-more-info / act 之间选择；
reward 同时惩罚错误回答、无效工具调用成本和过早放弃。

可用 Agentic Abstention + OSWorld/TUA-Bench 子集做初始实验。

#Idea 2：代码 Agent 的 tool-call credit assignment：从 TACO 扩展到 test/search/patch 粒度

TACO 的 Differential Answer-Probe Reward 可迁移到代码 Agent：

对每次 read_file / grep / run_test / patch 建立 counterfactual：如果没有这次调用，模型下一步定位/修复概率是否下降？
用 pass/fail + probe prediction 改善 sparse reward；
将 execution cost 纳入 advantage，避免“盲目跑测试”。

可结合 “To Run or Not to Run” 与 “Deterministic Anchoring” 做一个 cost-aware SWE-agent 训练/评测框架。

#Idea 3：长轨迹 Agent 的“gist-state”压缩：把 Simplified Sparse Attention 与 Agent memory 结合

长程 Agent 的 context 不是普通长文本，而是 action-observation-result 的结构化日志。可以训练专门的 gist tokens：

每个 episode chunk 压成若干 latent/gist tokens；
当前 query / subgoal 只展开相关 chunks；
对展开选择加入 RL 或 verifier feedback；
评估在 OSWorld2.0 / TUA-Bench / AgentOdyssey 上的性能-成本曲线。

这条线能连接 wenjun 关注的 通用上下文压缩器、latent-space reasoning、long-horizon agent RL。

#5. 快速索引表

标题	类别	日期	链接	一句话
Agentic Abstention	LLM Agent / Evaluation	2026-06-27 / HF 06-30	arXiv	把 Agent 是否继续行动/停止定义为序贯决策问题。
Agents-A1	LLM Agent / Long-horizon	2026-06-29 / HF 06-30	arXiv / GitHub	用 horizon scaling 和长程轨迹训练 35B agent。
TACO	Tool-use / RL	2026-06-29 / HF 06-30	arXiv	给工具调用做 judge-free credit assignment 的 GRPO 变体。
OSWorld2.0	Computer-use / Evaluation	2026-06-28 / HF 06-30	arXiv / GitHub	108 个长程真实 computer-use workflows。
SWE-Together	Code Agent / Evaluation	2026-06-29 / HF 06-30	arXiv / GitHub	从真实 user-agent coding sessions 重构交互式 coding benchmark。
TUA-Bench	Terminal Agent	2026-06-26 / HF 06-30	arXiv / GitHub	general-purpose terminal-use agent benchmark。
AsyncOPD	Post-training / Systems	2026-06-23 / HF 06-30	arXiv / GitHub	系统研究异步 OPD 中 stale-policy data 的影响。
ReFreeKV	Context Compression	2026-06-26 / HF 06-30	arXiv / GitHub	threshold-free KV cache compression。
GUICrafter	GUI Agent / Data	2026-06-29 / HF 06-30	arXiv / GitHub	用大规模未标注截图弱监督训练 GUI Agent。
Formalizing Latent Thoughts	Latent Reasoning	2026-05-07 / HF 06-29	arXiv / Project	形式化 LLM latent thought representation 的 axioms。
AgentOdyssey	Continual Agent / World Modeling	2026-05-29 / HF 06-29	arXiv / GitHub	用开放式长程 text game 评测 test-time continual learning agents。
Deterministic Anchoring	Code Agent	2026-06-25 / HF 06-29	arXiv	静态结构作为代码 Agent 的 deterministic anchors。
To Run or Not to Run	Code Agent / Tool Cost	2026-06-25 / HF 06-29	arXiv	分析程序修复中执行测试的成本收益。
Simplified Sparse Attention via Gist Tokens	Context Compression	HF 06-29	arXiv	continued pretraining 学出 gist tokens 做 sparse attention。

#6. 明日跟踪建议

看 Agents-A1 是否释放训练数据格式或 trajectory examples；如果有，优先分析 trajectory schema。
看 OSWorld2.0 / SWE-Together 是否已有 baseline logs，可用于研究 failure mode 和 reward design。
继续追 TACO 是否放代码；如果没有，可先复现其 DAPR 思想到 toy code-tool environment。