Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

Hey everyone,

I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.

Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.


Exact dedup vs DuckDB + SHA-256

fastdedup DuckDB
Wall clock 2:55 7:55
Peak RAM 688 MB 21.9 GB
CPU cores 1 4+
Records/sec ~85,000 ~31,000
Duplicates removed 51,392 51,392

2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.


Fuzzy dedup (MinHash + LSH) vs datatrove

fastdedup datatrove
Wall clock 36:44 3h50m+ (stage 1 only, terminated)
Peak RAM 23 GB 1.1 GB
Completed Y N
Duplicates removed 105,044 (0.7%)

datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.

On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.


Caveats

  • Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload

  • datatrove is designed for distributed execution and tasks=1 is not its optimal configuration — this benchmark represents how someone might run it locally

  • Tiered storage to spill the LSH index to disk is on the roadmap


There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]

Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]

Happy to answer questions about the implementation or methodology.

2 Likes

This is impressive benchmark! Rust for data processing is definitely the way to go.

Some questions:

  1. Are you planning to support fuzzy dedup with different similarity thresholds?
  2. Could this be integrated with the Datatrove library for a unified pipeline?
  3. Any plans for distributed processing across multiple machines?

The 2.7x speedup with 32x less RAM is huge for large-scale dataset preprocessing. Have you considered publishing this as a HuggingFace Space or integrating with the datasets library?

Great work!

1 Like

Thanks so much! Really appreciate the kind words.

Re: your questions:

1. Fuzzy dedup with different similarity thresholds?

Already supported! The --threshold flag lets you set any Jaccard similarity threshold (0.0-1.0). For example:

bash

fastdedup fuzzy --input data.parquet --output clean.parquet \
    --threshold 0.85 --field text

You can also configure the number of MinHash permutations with --num-hashes (default 128). Higher values = more accurate but slower.

2. Datatrove integration?

Interesting idea! Right now fastdedup is a standalone CLI, but I’ve considered integrating with Python bindings. I could expose the Rust core via PyO3 for direct Python import.

3. Distributed processing?

Not currently planned. The design philosophy is single-machine efficiency.

That said, the architecture could support it:

  • Exact dedup: shardable by hash prefix

  • Fuzzy dedup: could shard the LSH index across machines

But honestly, single-machine performance might be enough for most use cases. What scale are you thinking?

Re: HuggingFace integration:

The demo Space is already live: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo

A proper Python package would probably be cleaner long-term and thats most likely what will happen next.

What would be most useful for your workflow? Happy to prioritize based on real use cases! Thanks again.

1 Like