[Bug?] Array3D(dtype="uint8") persists as int64 after save_to_disk despite explicit casting

Title: [Bug?] Array3D(dtype=“uint8”) persists as int64 after save_to_disk despite explicit casting

Description

I am encountering an issue where a dataset defined with Array3D(dtype="uint8") is being saved and loaded as int64. This happens even when I explicitly cast the dataset and ensure the generator yields uint8 numpy arrays. This results in massive disk usage (approx. 2.6x larger than expected) and load in int64.

Environment

  • datasets version: 4.5.0
  • pyarrow version: 23.0.0

Minimal Reproducible Example

from datasets import Dataset, Features, Array3D, Value, load_from_disk


# 1. Define features with uint8
features_numpy = Features({
    'image': Array3D(shape=(256, 256, 3), dtype="uint8"),
    'label': Value('int64')
})

# 2. Generator yielding uint8 numpy arrays
def gen():
    for _ in range(10):
        yield {
            # Explicitly casting to uint8 here
            'image': np.zeros((256, 256, 3), dtype=np.uint8),
            'label': 0
        }

# 3. Create and Cast
ds = Dataset.from_generator(gen, features=features_numpy)
ds = ds.cast(features_numpy) # Forced casting before saving

print(f"Dtype before saving: {ds.features['image'].dtype}") # Shows uint8

# 4. Save and Load
ds.save_to_disk('test_dataset')
loaded_ds = load_from_disk('test_dataset')

# 5. Check format after loading
loaded_ds = loaded_ds.with_format("numpy")
sample = loaded_ds[0]['image']

print(f"Dtype after loading: {sample.dtype}") 
# Actual: int64 (Expected: uint8)`

Key Observations

  • Inconsistent Metadata: Interestingly, the dataset_info.json file generated in the save directory correctly lists the feature as dtype: uint8. However, the actual Arrow data seems to be stored as int64.
  • Disk Usage: When processing CIFAR-100 (resized to 256x256), the resulting directory took up 26GB. For uint8 data (50,000 \times 256 \times 256 \times 3 \times 1 byte), it should theoretically be around 9.8GB. The 26GB size suggests it’s being stored in a higher bit-depth.
  • Data Access: When I access the samples using .with_format("numpy"), the resulting array is always int64.

Questions

  1. Is this a known behavior where Array3D defaults to int64 during Arrow serialization if certain conditions aren’t met?
  2. Why does ds.cast(features_numpy) fail to enforce the uint8 type on the physical disk?
1 Like

I tested the behavior of the datasets library on Colab for now.


Summary for your case

  • Your data is not being physically stored as int64 (in the typical case). If inspecting the backing Arrow column shows the leaf value type is uint8, then the on-disk buffers holding pixel values are uint8.
  • The int64 you observe comes from with_format("numpy"), which applies a separate, on-the-fly conversion layer at __getitem__ time. This conversion has known dtype-mismatch bugs/footguns for array-like columns (e.g., returning int64 for integer arrays even when Features says uint8). (Hugging Face)
  • The ~2.6× disk bloat is explainable even with uint8 values due to nested Arrow ListArray offset buffers (structural overhead), not because values widened to int64. (GitHub)

1) Why do you see int64 after loading?

with_format("numpy") is not “just reading Arrow”; it converts on access

Hugging Face Datasets explicitly documents that formatting (including "numpy") is applied on-the-fly and that format_kwargs are passed to conversion functions like np.array. (Hugging Face)

So in:

loaded_ds = loaded_ds.with_format("numpy")
sample = loaded_ds[0]["image"]

you are not directly retrieving the Arrow leaf buffer as a NumPy view; you’re going through a conversion path that (today) can change dtypes.

This dtype mismatch is a known, repeatedly reported problem

There are multiple issues showing the same pattern:

  • uint8 features but NumPy outputs int64 (very similar to your complaint). (GitHub)
  • dtype mismatch for Array2D with with_format("numpy") even when dtype is user-specified. (GitHub)
  • with_format("numpy") silently downcasts floats (shows that formatting can alter dtype). (GitHub)

Practical implication

Your observation “after loading it’s int64” is consistent with a formatting-layer behavior/bug, not necessarily Arrow storage.


2) Why does disk usage blow up (~2.6×)?

Background: Array3D is stored as nested lists with offsets

For multi-dimensional arrays, Datasets converts a NumPy array into an Arrow ListArray by:

  1. flattening values
  2. wrapping them repeatedly in list layers, creating int32 offset arrays for each list nesting level

The conversion code is visible in numpy_to_pyarrow_listarray: it creates offsets using np.arange(n_offsets + 1) * step_offsets with pa.int32(), then wraps values with pa.ListArray.from_arrays(offsets, values) in a loop. (Hugging Face)

Why offsets cost so much for images

Arrow List arrays store an offset buffer (length = number of lists + 1) that describes where each list starts/ends. (Apache Arrow)

For an image (H, W, C) represented as nested lists, one major overhead is at the “pixel list” level:

  • there are H * W lists per image (each list length = C)
  • offsets are int32 → 4 bytes each
  • offset buffer size per image at that level is approximately (H*W + 1) * 4

For your CIFAR→256×256×3 example, per image:

  • raw uint8 payload = 256*256*3 = 196,608 bytes
  • pixel offsets ≈ (256*256 + 1) * 4 = 262,148 bytes
  • plus additional (smaller) offsets for the next list level(s), plus Arrow file/metadata overhead

That already puts you around ~2× before counting higher-level offsets and file overhead; empirically, this aligns with reports that Array2D/Array3D can have “unreasonably high” memory/storage overhead. (GitHub)

This is why a 26GB directory can be plausible even if the leaf values are uint8, because you’re paying for structure (offsets), not value widening.


3) Why ds.cast(features_numpy) doesn’t “enforce uint8 on disk” (as you expect)

Case A (most common): disk is already uint8; only NumPy formatting is wrong

If Arrow leaf values are uint8, then cast() did its job for storage—your “int64” symptom is from the on-the-fly formatting layer. (Hugging Face)

Case B: even if cast() ensures leaf dtype, it doesn’t change the physical representation choice

Features specifies the serialization format and schema. (Hugging Face)
But Array3D still uses the nested ListArray storage strategy described above, so cast() cannot remove offset buffers or switch you to a fixed-size tensor representation. (Hugging Face)

So cast() may enforce the leaf value type, but it doesn’t guarantee:

  • “no dtype change during with_format("numpy")”
  • “disk size equals raw uint8 payload”

4) What you should do (verification + fixes)

A) Verify whether the Arrow leaf is uint8 (this is the decisive test)

Do this before with_format("numpy"):

import pyarrow as pa
from datasets import load_from_disk

ds = load_from_disk("test_dataset")
tbl = ds.data
col = tbl.column("image").chunk(0)

# unwrap extension storage
arr = col.storage if hasattr(col, "storage") else col

# walk down nested lists to leaf
while pa.types.is_list(arr.type) or pa.types.is_large_list(arr.type) or pa.types.is_fixed_size_list(arr.type):
    arr = arr.values

print("Leaf type:", arr.type)

If it prints uint8, your data is stored as uint8 and the “int64” comes from formatting.

B) Fix the output dtype (workaround for formatting bug)

Because with_format passes kwargs to the conversion function, force the dtype:

import numpy as np
loaded = load_from_disk("test_dataset").with_format(
    "numpy",
    columns=["image"],
    dtype=np.uint8,
    output_all_columns=True,
)

This is consistent with how with_format is documented (format kwargs passed to np.array). (Hugging Face)

C) Fix disk size (change representation; Array3D is offset-heavy)

If disk footprint matters at 256×256×3 scale, consider alternatives:

  1. Flatten + fixed-length Sequence(length=H*W*C)
    HF staff explicitly states that when Sequence.length is specified, it’s stored as a fixed PyArrow list and “does not store the offsets,” reducing memory. (Hugging Face Forums)

  2. Store encoded images (Image() feature) or raw bytes (Value("binary"))
    These avoid “offset per pixel” overhead and typically reduce size dramatically.


Direct answers to your two questions

  1. Does Array3D default to int64 during Arrow serialization?
    Not as a general rule. What’s widely reported is that NumPy formatting (with_format("numpy")) can return int64 regardless of the feature dtype for array-like columns. (GitHub)

  2. Why does cast() fail to enforce uint8 on physical disk?
    Usually it doesn’t; the leaf can still be uint8. The mismatch you see is typically formatting-time conversion. And independently, cast() doesn’t change that Array3D uses nested ListArrays with int32 offsets, which drives up disk usage. (Hugging Face)

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.