Common Crawl Pipeline Creator
1. URL Filtering
Performs filtering based on samples urls.
use the datatrove integrated lists of banned urls and words
2. Text Extraction
Uses the Trafilatura extractor.
prefer less text but correct extraction
trafilatura's deduplicate option
3. Language Filtering
Uses the fastext language identification models.
4. Gopher Filtering (repetitions)
Uses the Gopher text repetition filters.
tokenizer language
8. PII Removal
Replaces email addresses and ip addresses in the document text.
Replace email addresses
Replace IP addresses
by default we only replace public (and thus PII) IPs
7. Custom Filters
Uses the FineWeb custom text filters.
6. C4 Filters
Uses the C4 text size and content filters.
disable to apply the filters to each sentence instead of to each line
tokenizer language
remove wikipedia style citations from the text
remove lines without terminal punctuation marks
drop documents that contain 'lorem ipsum'
drop lines mentioning 'javascript'
drop documents containing {
drop lines containing any of the policy phrases (e.g. 'terms of use', 'use cookies')
5. Gopher Filtering (quality)
Uses the Gopher text quality filters.
tokenizer language
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|---|---|
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import (
C4QualityFilter,
FineWebQualityFilter,
GopherQualityFilter,
GopherRepetitionFilter,
LanguageFilter,
URLFilter,
)
from datatrove.pipeline.formatters import PIIFormatter
from datatrove.pipeline.readers import WarcReader
pipeline_executor = LocalPipelineExecutor(
pipeline=[
WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments", glob_pattern="*/warc/*"),
URLFilter(),
Trafilatura(),
LanguageFilter(),
GopherRepetitionFilter(),
GopherQualityFilter(),
C4QualityFilter(),
FineWebQualityFilter(),
PIIFormatter()
]
)
pipeline_executor.run()
powered by datatrove