Daniel van Strien

305 Followers
357 Following
199 Posts
πŸ“–πŸ€— Machine learning Librarian at Hugging Face

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

Researchers: Want your ML datasets to have more impact? Share them on @huggingface Hub!

✨ Benefits:
β€’ Visibility in the ML community
β€’ Interactive data viewer
β€’ Support for TB-scale datasets
β€’ Integration with @DataPolars @pandas_dev @duckdb and more
https://huggingface.co/blog/researcher-dataset-sharing

Creating open machine learning datasets? Share them on the Hugging Face Hub!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

ColPali is revolutionizing multimodal retrieval. Can we make it even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I create a dataset for fine-tuning a ColPali model for a new domain using an open Vision Language Model.

https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien

Using an open VLM to generate queries for a multimodal retrieval model

Can we search for datasets on the @huggingface Hub based on their content?

> Some datasets lack good documentation 😒
> The dataset viewer preview offers a wealth of information

πŸ€” How about: query -> dataset based on structure content?

Check out V1: https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

You can help improve this project by rating synthetic user search queries for hub datasets. If you have a @huggingface login, you can start annotating in @argilla_io in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode
Argilla

I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!

Almost ready: search for a @huggingface dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

The @huggingface's Semantic Dataset Search is back in action! Find similar datasets by ID or do a semantic search of dataset cards.

Give it a try:
https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

@arnicas Occasionally the books sounds interesting but often the blurbs are not very good. Think LLMs are still very lacking in this kind of task tbh.

Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using @huggingface Spaces and @gradio to collect LLM output preferences: https://huggingface.co/spaces/davanstrien/would-you-read-it

Would You Read It - a Hugging Face Space by davanstrien

Discover amazing ML apps made by the community