Daniel van Strien

305 Followers
357 Following
199 Posts
πŸ“–πŸ€— Machine learning Librarian at Hugging Face

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

ColPali is revolutionizing multimodal retrieval. Can we make it even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I create a dataset for fine-tuning a ColPali model for a new domain using an open Vision Language Model.

https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien

Using an open VLM to generate queries for a multimodal retrieval model

Can we search for datasets on the @huggingface Hub based on their content?

> Some datasets lack good documentation 😒
> The dataset viewer preview offers a wealth of information

πŸ€” How about: query -> dataset based on structure content?

Check out V1: https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

You can help improve this project by rating synthetic user search queries for hub datasets. If you have a @huggingface login, you can start annotating in @argilla_io in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode
Argilla

Almost ready: search for a @huggingface dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

The @huggingface's Semantic Dataset Search is back in action! Find similar datasets by ID or do a semantic search of dataset cards.

Give it a try:
https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search

Semantic Dataset Search - a Hugging Face Space by librarian-bots

Discover amazing ML apps made by the community

Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using @huggingface Spaces and @gradio to collect LLM output preferences: https://huggingface.co/spaces/davanstrien/would-you-read-it

Would You Read It - a Hugging Face Space by davanstrien

Discover amazing ML apps made by the community

SPIQA from @Google is a large-scale question-answering dataset centred on figures, tables, and text paragraphs from scientific research papers in various computer science domains.
https://huggingface.co/datasets/google/spiqa
google/spiqa Β· Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

GitHub - huggingface/data-is-better-together: Let's build better datasets, together!

Let's build better datasets, together! Contribute to huggingface/data-is-better-together development by creating an account on GitHub.

GitHub
As part of the Multilingual Prompt Evaluation Project (MPEP), we are now automatically exporting the @argilla_io datasets to the @huggingface Hub. We have more than 15 active community-led translation efforts collaborating to enhance datasets for various languages. ❀️