arxiv:2606.01008

FVSpec: Real-World Property-Based Tests as Lean Challenges

Published on May 31

· Submitted by

Quinn Dougherty on Jun 15

Galois Inc

Upvote

Authors:

Quinn Dougherty ,

Abstract

A benchmark for AI-assisted formal verification is presented, involving the translation of property-based tests from Python into Lean specifications using a multi-agent LLM pipeline.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.

View arXiv page View PDF Project page Add to collection

Community

quinn-dougherty

Paper author Paper submitter 1 day ago

FVSpec: Real-World Property-Based Tests as Lean Challenges

noahml

about 10 hours ago

Neat paper. Connecting real-world Python property-based tests to Lean specifications is a clever way to bridge the gap between everyday coding and formal verification. The scale of the dataset seems impressive given how tricky translating imperative code to dependently-typed logic usually is.

How did the three-agent pipeline handle the translation of Python semantics into Lean without losing the original intent of the tests?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/5c9523ec-5535-4592-9d7e-95588263aa09

librarian-bot

about 4 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.01008

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01008 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01008 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.