Mastodawn

Daniel van Strien Jan 3, 2025

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

Daniel van Strien Oct 29, 2024

Researchers: Want your ML datasets to have more impact? Share them on @huggingface Hub!

✨ Benefits:
• Visibility in the ML community
• Interactive data viewer
• Support for TB-scale datasets
• Integration with @DataPolars @pandas_dev @duckdb and more
https://huggingface.co/blog/researcher-dataset-sharing

Creating open machine learning datasets? Share them on the Hugging Face Hub!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Daniel van Strien Sep 26, 2024

ColPali is revolutionizing multimodal retrieval. Can we make it even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I create a dataset for fine-tuning a ColPali model for a new domain using an open Vision Language Model.

https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien

Using an open VLM to generate queries for a multimodal retrieval model

Daniel van Strien Sep 12, 2024

Can we search for datasets on the @huggingface Hub based on their content?

> Some datasets lack good documentation 😢
> The dataset viewer preview offers a wealth of information

🤔 How about: query -> dataset based on structure content?

Check out V1: https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

Show thread

Daniel van Strien Sep 10, 2024

You can help improve this project by rating synthetic user search queries for hub datasets. If you have a @huggingface login, you can start annotating in @argilla_io in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode

Argilla

Show thread

Daniel van Strien Sep 10, 2024

I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!

Daniel van Strien Sep 10, 2024

Almost ready: search for a @huggingface dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

Daniel van Strien Sep 9, 2024

The @huggingface's Semantic Dataset Search is back in action! Find similar datasets by ID or do a semantic search of dataset cards.

Give it a try:
https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

Daniel van Strien Aug 7, 2024

Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using @huggingface Spaces and @gradio to collect LLM output preferences: https://huggingface.co/spaces/davanstrien/would-you-read-it

Would You Read It - a Hugging Face Space by davanstrien

Discover amazing ML apps made by the community

Daniel van Strien Jul 15, 2024

SPIQA from @Google is a large-scale question-answering dataset centred on figures, tables, and text paragraphs from scientific research papers in various computer science domains.
https://huggingface.co/datasets/google/spiqa

google/spiqa · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.