LoadDataSet pyarrow.lib.ArrowCapacityError

I use

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=β€œtrain”)

Report when loading dataset (approximately 84GB)

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509,

Try setting up according to the help provided in other posts

set(data_set[β€œhash”])

I still haven’t solved the above problem. Do you have any ways to help me solve it? Thank you!

My version information is as follows

  • datasets version: 3.2.0
  • Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • huggingface_hub version: 0.26.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.2.0

Apparently, this is an issue with PyArrow, and although some of it has been resolved, it still seems to be unresolved. @lhoestq

Yes, I have seen similar posts with the same issue:

Minhash Deduplication - #11 by conceptofmind

But I tried this method and it didn’t solve the error.
May I ask if there is any way you can help me solve this problem?
Thank you!

How about trying .shard()?

Thank you for your reply.
Is. shard() partitioned only when generating the load_dataset() object


But this error occurred during load_dataset()

It may also be another limitation of PyArrow. If you set num_shards to around 20, maybe it will work… I hope it does.

I have already set num_stards to 100, but the same error still exists

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=β€œtrain”)
data_set = data_set.shard(num_shards=100, index=0)

It seems that this error already exists when executing load_dataset()