I use
data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=βtrainβ)
Report when loading dataset (approximately 84GB)
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509οΌ
Try setting up according to the help provided in other posts
set(data_set[βhashβ])
I still havenβt solved the above problem. Do you have any ways to help me solve it? Thank you!
My version information is as follows
datasets version: 3.2.0
- Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.10
huggingface_hub version: 0.26.5
- PyArrow version: 17.0.0
- Pandas version: 2.2.3
fsspec version: 2024.2.0
Apparently, this is an issue with PyArrow, and although some of it has been resolved, it still seems to be unresolved. @lhoestq
Yes, I have seen similar posts with the same issue:
Minhash Deduplication - #11 by conceptofmind
But I tried this method and it didnβt solve the error.
May I ask if there is any way you can help me solve this problem?
Thank you!
How about trying .shard()?
Thank you for your reply.
Is. shard() partitioned only when generating the load_dataset() object
But this error occurred during load_dataset()
It may also be another limitation of PyArrow. If you set num_shards to around 20, maybe it will work⦠I hope it does.
I have already set num_stards to 100, but the same error still exists
data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=βtrainβ)
data_set = data_set.shard(num_shards=100, index=0)
It seems that this error already exists when executing load_dataset()