Daniel van Strien

305 Followers
357 Following
199 Posts
📖🤗 Machine learning Librarian at Hugging Face

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

Researchers: Want your ML datasets to have more impact? Share them on @huggingface Hub!

✨ Benefits:
• Visibility in the ML community
• Interactive data viewer
• Support for TB-scale datasets
• Integration with @DataPolars @pandas_dev @duckdb and more
https://huggingface.co/blog/researcher-dataset-sharing

Creating open machine learning datasets? Share them on the Hugging Face Hub!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

ColPali is revolutionizing multimodal retrieval. Can we make it even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I create a dataset for fine-tuning a ColPali model for a new domain using an open Vision Language Model.

https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien

Using an open VLM to generate queries for a multimodal retrieval model

Can we search for datasets on the @huggingface Hub based on their content?

> Some datasets lack good documentation 😢
> The dataset viewer preview offers a wealth of information

🤔 How about: query -> dataset based on structure content?

Check out V1: https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

You can help improve this project by rating synthetic user search queries for hub datasets. If you have a @huggingface login, you can start annotating in @argilla_io in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode
Argilla

I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!

Almost ready: search for a @huggingface dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

The @huggingface's Semantic Dataset Search is back in action! Find similar datasets by ID or do a semantic search of dataset cards.

Give it a try:
https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using @huggingface Spaces and @gradio to collect LLM output preferences: https://huggingface.co/spaces/davanstrien/would-you-read-it

Would You Read It - a Hugging Face Space by davanstrien

Discover amazing ML apps made by the community

SPIQA from @Google is a large-scale question-answering dataset centred on figures, tables, and text paragraphs from scientific research papers in various computer science domains.
https://huggingface.co/datasets/google/spiqa
google/spiqa · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.