Common Crawl Pipeline Creator

1. URL Filtering

Performs filtering based on samples urls.

use the datatrove integrated lists of banned urls and words

0 5

2. Text Extraction

Uses the Trafilatura extractor.

prefer less text but correct extraction

0.05 0.5

trafilatura's deduplicate option

3. Language Filtering

Uses the fastext language identification models.

0 1

4. Gopher Filtering (repetitions)

Uses the Gopher text repetition filters.

language

tokenizer language

0 1
0 1
0 1
0 1

8. PII Removal

Replaces email addresses and ip addresses in the document text.

Replace email addresses

Replace IP addresses

by default we only replace public (and thus PII) IPs

7. Custom Filters

Uses the FineWeb custom text filters.

0 1
0 1
0 100
0 1
0 1

6. C4 Filters

Uses the C4 text size and content filters.

disable to apply the filters to each sentence instead of to each line

language

tokenizer language

0 10
0 10
0 2000

remove wikipedia style citations from the text

remove lines without terminal punctuation marks

drop documents that contain 'lorem ipsum'

drop lines mentioning 'javascript'

drop documents containing {

drop lines containing any of the policy phrases (e.g. 'terms of use', 'use cookies')

5. Gopher Filtering (quality)

Uses the Gopher text quality filters.

language

tokenizer language

0 1000
0 200000
0 20
0 20
0 1
0 1
0 1
0 1
0 10
1
2
3

powered by datatrove