SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
OpenThoughts: Data Recipes for Reasoning Models Paper • 2506.04178 • Published Jun 4, 2025 • 55
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows Paper • 2505.07897 • Published May 12, 2025
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Paper • 2409.16165 • Published Sep 24, 2024
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration Paper • 2412.15701 • Published Dec 20, 2024
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 36
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Paper • 2602.22124 • Published Feb 25 • 2
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 23 days ago • 14
ProgramBench: Can Language Models Rebuild Programs From Scratch? Paper • 2605.03546 • Published 10 days ago • 3
SWE-smith Collection SWE-smith datasets of task instances for different programming languages • 9 items • Updated Mar 9 • 3