Hi, I’m trying to do some simple filtering of FineWeb dataset. The dataset page provides examples of how to process it using datatrove, but datatrove only supports Slurm as a distributed executor. I don’t have access to Slurm. I do have access to Ray cluster, thus I wrote a ray.data pipeline to process parquet files of FineWeb and save the results to a GCS bucket (without downloading of the dataset to cluster nodes).
It’s working fine for 10BT subsampled version of dataset when running with 16 parallel tasks. However when I was trying to scale up the number of parallel tasks for a bigger version of the dataset it seems like I was getting rate-limited. I.e. after a few minutes of runtime I started to get “Bad request” errors from Ray’s read_parquet.
So my question are:
- Is there any rate limit on access/reads of the FineWeb?
- Is there a way to copy/clone the dataset directly to GCS without going through local file system?
I’ve tried follwing this docs about cloud storage. But no matter what I do, the execution of all code examples first downloads the dataset locally (which is “problematic” for the dataset of this size)
I would also appreciate any best practices for processing datasets of this sizes hosted on Huggingface ![]()