Abstract
A benchmark for AI-assisted formal verification is presented, involving the translation of property-based tests from Python into Lean specifications using a multi-agent LLM pipeline.
We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.
Community
FVSpec: Real-World Property-Based Tests as Lean Challenges
We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.
Neat paper. Connecting real-world Python property-based tests to Lean specifications is a clever way to bridge the gap between everyday coding and formal verification. The scale of the dataset seems impressive given how tricky translating imperative code to dependently-typed logic usually is.
How did the three-agent pipeline handle the translation of Python semantics into Lean without losing the original intent of the tests?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/5c9523ec-5535-4592-9d7e-95588263aa09
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PBT-Bench: Benchmarking AI Agents on Property-Based Testing (2026)
- Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems (2026)
- VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation (2026)
- Benchmarking Testing in Automated Theorem Proving (2026)
- Agentic Proving for Program Verification (2026)
- A Rust-to-Lean Verification Pipeline with AI Provers: An Experience Report (2026)
- Agentic Separation Logic Specification Synthesis (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.01008 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
GaloisInc/fvspec-fv
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper