Is Parquet becoming the bottleneck? Why new storage formats are emerging in 2025 (Lance, Vortex, and more)

Parquet gave data lakes a common language: columnar layout, good compression, and fast scans. That still works well for classic analytics. But workloads have changed. We now mix wide scans with point lookups, handle embeddings and images, and run on S3-first stacks. On NVMe you want lots of tiny random reads. On S3 you want fewer, larger range requests. A format tuned for one world can feel chatty or slow in the other.

Databend Cloud

Гайд: Как работать с форматом PARQUET

В прошлом году мы начали публиковать данные в каталоге «Если быть точным» в формате Parquet . Его придумали инженеры Twitter и Cloudera в 2013 году, и сегодня он стал стандартом хранения аналитических данных — его используют Google, Amazon, Netflix и большинство современных data-платформ. В этом гайде мы расскажем, как эффективно работать с данными в формате Parquet с помощью Python.

https://habr.com/ru/articles/1013604/

#parquet #python #анализ_данных

Гайд: Как работать с форматом PARQUET

В прошлом году мы начали публиковать данные в каталоге «Если быть точным» в формате Parquet . Его придумали инженеры Twitter и Cloudera в 2013 году, и сегодня он стал стандартом хранения аналитических...

Хабр
🐒 Ah, yes, the holy grail of nerd bragging rights: a 47M+ item #archive of Hacker News, now in the culinary delight format of #Parquet for all your "data chef" needs. 🍽️ Updated every 5 minutes, because clearly, what's more riveting than a play-by-play of techie's daily musings? Oh wait, I forgot—🥱 anything else.
https://huggingface.co/datasets/open-index/hacker-news #HackerNews #DataChef #TechieBraggingRights #DailyUpdates #HackerNews #ngated
open-index/hacker-news · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

open-index/hacker-news · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

scrapy-contrib-bigexporter 1.1.0 released. Scrape data using Scrapy in parquet,avro,orc or iceberg format. Changes: CI/CD pipeline on Codeberg Actions, Update Actions, Apply strict schema to Arrow table if schema is provided.

https://codeberg.org/ZuInnoTe/scrapy-contrib-bigexporters

#scrapy #python #parquet #iceberg #avro #orc #webscraping

scrapy-contrib-bigexporters

Scrapy exporter for Big Data formats

Codeberg.org
Ho provato a riversare un dump #Wikidata in #Parquet e ad interrogarlo con #DuckDB: ci mette meno di un'ora ad estrapolare tutte le 19.939.182 entità che rappresentano persone, incluse le sottoclassi di wdt:Q5.
Decisamente meglio del mio deserializzatore implementato in Go, che per fare la stessa cosa ci mette quasi 8 ore.
@zulfian Thanks for providing this application to my desktop! I'm a fan of #Parquet and #Iceberg/#Delta, and the possibilities these formats give with e.g. schema compared to CSV/TSV. Anyways, I have your first issue here: https://gitlab.com/zulfian1732/munquet/-/issues/1 - if you need help, just let me know, and we'll see what we can do together.
Ignore comments in CSV/TSV files by default (#1) · Issues · Zulfian / munquet · GitLab

After importing a CSV/TSV, which has comments in the first rows marked with #, Munquet tries to use these comments as column headers. Consider providing an option or...

GitLab

Munquet 0.2.1 just landed on Flathub 🚀

Fixed a small race condition when canceling a conversion — turns out the process could finish right before you clicked “Yes” 😅

Two lines later… all good.

https://flathub.org/en/apps/io.gitlab.zulfian1732.munquet

#Flatpak #GTK4 #OpenSource #Parquet #DataScience #Linux #Python #PyArrow

Install Munquet on Linux | Flathub

Convert to Parquet

Munquet 0.2.0 is now available on Flathub 🎉

✨ Display real host paths via XDG Portal
🛠 Introduced a .Devel Flatpak manifest for development builds

Continuing to improve the Linux desktop data workflow 🚀

https://flathub.org/en/apps/io.gitlab.zulfian1732.munquet

#Flathub #Flatpak #XDGPortal #GTK4 #OpenSource #Parquet #Python #DataScience

Install Munquet on Linux | Flathub

Convert to Parquet

Munquet is now officially on Flathub 🎉

A native Linux app to convert datasets into Apache Parquet using PyArrow backend. Perfect for data science workflows, analytics, and anyone needing fast local conversions.

Get it here: https://flathub.org/en/apps/io.gitlab.zulfian1732.munquet

@gnome @xfce @kde @GTK @linux @flathub

#apache #pyarrow #datascience #parquet #csv #OpenSource #Python #GNOME #GTK4 #Adwaita

Install Munquet on Linux | Flathub

Convert to Parquet