view article Article FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages Jul 8, 2025 β’ 35
Automatic Metadata Generation and Extraction datasets Collection Datasets which can help train or evaluate various approaches to automatic metadata generation and extraction. β’ 4 items β’ Updated Oct 16, 2025 β’ 4
view article Article The Hugging Face Hub for Galleries, Libraries, Archives and Museums Jun 12, 2023 β’ 3
view article Article FineWeb2-C: Help Build Better Language Models in Your Language Dec 23, 2024 β’ 21
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing Jul 19, 2024 β’ 20
Probably function calling datasets Collection Created using the https://huggingface.co/spaces/librarian-bots/dataset-column-search-api Space. β’ 39 items β’ Updated Jul 17, 2024 β’ 39
synthetic-data-generation-demos Collection A collection of demos for various approaches to synthetic data generation β’ 4 items β’ Updated Jun 25, 2024 β’ 14
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model β’ 4 items β’ Updated Jun 21, 2024 β’ 23
view article Article π¦βοΈ Using Llama3 and distilabel to build fine-tuning datasets Jun 4, 2024 β’ 79
StarCraftImage: A Dataset For Prototyping Spatial Reasoning Methods For Multi-Agent Environments Paper β’ 2401.04290 β’ Published Jan 9, 2024 β’ 3
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding Paper β’ 2401.04575 β’ Published Jan 9, 2024 β’ 18
AeroPath: An airway segmentation benchmark dataset with challenging pathology Paper β’ 2311.01138 β’ Published Nov 2, 2023 β’ 6
RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts Paper β’ 2312.00306 β’ Published Dec 1, 2023 β’ 2
SynFundus: Generating a synthetic fundus images dataset with millions of samples and multi-disease annotations Paper β’ 2312.00377 β’ Published Dec 1, 2023 β’ 3
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling Paper β’ 2308.07777 β’ Published Aug 15, 2023 β’ 2
smol models Collection Models where the size of the model file (model.safetensors or pytorch_model.bin) < 50mb β’ 58 items β’ Updated Jul 3, 2024 β’ 8