I tested the behavior of the datasets library on Colab for now.
Summary for your case
- Your data is not being physically stored as
int64 (in the typical case). If inspecting the backing Arrow column shows the leaf value type is uint8, then the on-disk buffers holding pixel values are uint8.
- The
int64 you observe comes from with_format("numpy"), which applies a separate, on-the-fly conversion layer at __getitem__ time. This conversion has known dtype-mismatch bugs/footguns for array-like columns (e.g., returning int64 for integer arrays even when Features says uint8). (Hugging Face)
- The ~2.6Ă disk bloat is explainable even with
uint8 values due to nested Arrow ListArray offset buffers (structural overhead), not because values widened to int64. (GitHub)
1) Why do you see int64 after loading?
with_format("numpy") is not âjust reading Arrowâ; it converts on access
Hugging Face Datasets explicitly documents that formatting (including "numpy") is applied on-the-fly and that format_kwargs are passed to conversion functions like np.array. (Hugging Face)
So in:
loaded_ds = loaded_ds.with_format("numpy")
sample = loaded_ds[0]["image"]
you are not directly retrieving the Arrow leaf buffer as a NumPy view; youâre going through a conversion path that (today) can change dtypes.
This dtype mismatch is a known, repeatedly reported problem
There are multiple issues showing the same pattern:
uint8 features but NumPy outputs int64 (very similar to your complaint). (GitHub)
- dtype mismatch for
Array2D with with_format("numpy") even when dtype is user-specified. (GitHub)
with_format("numpy") silently downcasts floats (shows that formatting can alter dtype). (GitHub)
Practical implication
Your observation âafter loading itâs int64â is consistent with a formatting-layer behavior/bug, not necessarily Arrow storage.
2) Why does disk usage blow up (~2.6Ă)?
Background: Array3D is stored as nested lists with offsets
For multi-dimensional arrays, Datasets converts a NumPy array into an Arrow ListArray by:
- flattening values
- wrapping them repeatedly in list layers, creating
int32 offset arrays for each list nesting level
The conversion code is visible in numpy_to_pyarrow_listarray: it creates offsets using np.arange(n_offsets + 1) * step_offsets with pa.int32(), then wraps values with pa.ListArray.from_arrays(offsets, values) in a loop. (Hugging Face)
Why offsets cost so much for images
Arrow List arrays store an offset buffer (length = number of lists + 1) that describes where each list starts/ends. (Apache Arrow)
For an image (H, W, C) represented as nested lists, one major overhead is at the âpixel listâ level:
- there are
H * W lists per image (each list length = C)
- offsets are
int32 â 4 bytes each
- offset buffer size per image at that level is approximately
(H*W + 1) * 4
For your CIFARâ256Ă256Ă3 example, per image:
- raw
uint8 payload = 256*256*3 = 196,608 bytes
- pixel offsets â
(256*256 + 1) * 4 = 262,148 bytes
- plus additional (smaller) offsets for the next list level(s), plus Arrow file/metadata overhead
That already puts you around ~2Ă before counting higher-level offsets and file overhead; empirically, this aligns with reports that Array2D/Array3D can have âunreasonably highâ memory/storage overhead. (GitHub)
This is why a 26GB directory can be plausible even if the leaf values are uint8, because youâre paying for structure (offsets), not value widening.
3) Why ds.cast(features_numpy) doesnât âenforce uint8 on diskâ (as you expect)
Case A (most common): disk is already uint8; only NumPy formatting is wrong
If Arrow leaf values are uint8, then cast() did its job for storageâyour âint64â symptom is from the on-the-fly formatting layer. (Hugging Face)
Case B: even if cast() ensures leaf dtype, it doesnât change the physical representation choice
Features specifies the serialization format and schema. (Hugging Face)
But Array3D still uses the nested ListArray storage strategy described above, so cast() cannot remove offset buffers or switch you to a fixed-size tensor representation. (Hugging Face)
So cast() may enforce the leaf value type, but it doesnât guarantee:
- âno dtype change during
with_format("numpy")â
- âdisk size equals raw
uint8 payloadâ
4) What you should do (verification + fixes)
A) Verify whether the Arrow leaf is uint8 (this is the decisive test)
Do this before with_format("numpy"):
import pyarrow as pa
from datasets import load_from_disk
ds = load_from_disk("test_dataset")
tbl = ds.data
col = tbl.column("image").chunk(0)
# unwrap extension storage
arr = col.storage if hasattr(col, "storage") else col
# walk down nested lists to leaf
while pa.types.is_list(arr.type) or pa.types.is_large_list(arr.type) or pa.types.is_fixed_size_list(arr.type):
arr = arr.values
print("Leaf type:", arr.type)
If it prints uint8, your data is stored as uint8 and the âint64â comes from formatting.
B) Fix the output dtype (workaround for formatting bug)
Because with_format passes kwargs to the conversion function, force the dtype:
import numpy as np
loaded = load_from_disk("test_dataset").with_format(
"numpy",
columns=["image"],
dtype=np.uint8,
output_all_columns=True,
)
This is consistent with how with_format is documented (format kwargs passed to np.array). (Hugging Face)
C) Fix disk size (change representation; Array3D is offset-heavy)
If disk footprint matters at 256Ă256Ă3 scale, consider alternatives:
-
Flatten + fixed-length Sequence(length=H*W*C)
HF staff explicitly states that when Sequence.length is specified, itâs stored as a fixed PyArrow list and âdoes not store the offsets,â reducing memory. (Hugging Face Forums)
-
Store encoded images (Image() feature) or raw bytes (Value("binary"))
These avoid âoffset per pixelâ overhead and typically reduce size dramatically.
Direct answers to your two questions
-
Does Array3D default to int64 during Arrow serialization?
Not as a general rule. Whatâs widely reported is that NumPy formatting (with_format("numpy")) can return int64 regardless of the feature dtype for array-like columns. (GitHub)
-
Why does cast() fail to enforce uint8 on physical disk?
Usually it doesnât; the leaf can still be uint8. The mismatch you see is typically formatting-time conversion. And independently, cast() doesnât change that Array3D uses nested ListArrays with int32 offsets, which drives up disk usage. (Hugging Face)