ELT процесс в архитектуре Data lakehouse на базе open-source (kafka, dagster, s3+iceberg, trino, clickhouse и DBT)

К нам обратился один из крупнейших строительных холдингов России (ГК компаний из 10+ юридических лиц) с потребностью в сборе всех данных с филиалом, анализе и визуализации на дашбордах. При входе на проект аналитической инфраструктуры у компании почти не было, только множество учетных систем без централизованного хранилища данных. Объем проекта был непонятен, «аппетит приходит во время еды». Важная особенность проекта — полностью закрытый контур с доступом через терминальные решения. Было решение выбрать архитектуру Data Lakehouse на open source стеке, основой которого стали — kafka, dagster, s3+iceberg, trino, clickhouse и DBT. В результате получилось более 1000 моделей DBT, 1 тб сжатых данных, и объем продолжает расти. Из потребителей данных — бизнес системы, Power BI отчеты, аналитики и дата‑инженеры, веб‑приложения, MDX‑кубы. Методология ведения проекта Scrum, команда DWH‑инженеров 11 человек и greenfield‑разработка.

https://habr.com/ru/articles/931282/

#dbt #ymlфайл #datalakehouse #data_engineering #etlпроцессы #open_source #trino #clickhouse #dagster

ELT процесс в архитектуре Data lakehouse на базе open-source (kafka, dagster, s3+iceberg, trino, clickhouse и DBT)

К нам обратился один из крупнейших строительных холдингов России (ГК компаний из 10+ юридических лиц) с потребностью в сборе всех данных с филиалом, анализе и визуализации на дашбордах. При входе на...

Хабр
Oh, look! Another magical Python-based data lakehouse 🏠🐍 that promises to solve all your problems by adopting #Bauplan and #marimo. Because clearly, the solution to data workflow woes is yet another tool even fewer people will bother to use, all wrapped up in a blog post dripping with empty jargon. 🎉 Good luck getting those models past the sandbox, nerds! 🚀
https://www.bauplanlabs.com/blog/everything-as-python #Python #DataLakehouse #DataWorkflow #TechJargon #HackerNews #ngated
Everything-as-Python — bauplan

Run AI models, data transformation pipelines, and real-time analytics on your data lake with a self-optimizing, serverless runtime. No infrastructure overhead—just Python.

Everything-as-Python — bauplan

Run AI models, data transformation pipelines, and real-time analytics on your data lake with a self-optimizing, serverless runtime. No infrastructure overhead—just Python.

Ist ja nicht so sehr business network hier aber vielleicht ist heute zufällig noch jemand auf der TechShow Frankfurt unterwegs bzw Big Data & AI World?

Aus der @OSBA ?

#DataEngineering #databricks #dremio #Stackit #messefrankfurt #DataLakehouse

Attended an event Brewing Data with Snowflake yesterday in Vilnius 

Some of they key insights:

  • Medallion Architecture (good or bad) is widespread.
  • Snowflake and Databricks are clear competitors, targeting similar landscape.
  • Open formats are trending: file format, table format, catalog, etc. - the more of them are open source, the better.
  • Time travel feature is important, many users already used it for disaster recovery.
  • Clear distinction of Storage from Compute (generic cloud approach).

Full text of one of the slides presented:

Strategic Architecture Outlook

  • Agility & Future-Proofing - Open, portable data means you can adopt new technologies or switch platforms with minimal friction. No single vendor can hold your data hostage, so you can evolve vour architecture as needed.
  • Multi-Cloud and Hybrid - An open data layer can span clouds and on-prem seamlessly. You avoid cloud vendor lock-in and leverage best-of-breed services on different clouds using the same data. This flexibility is key for resilience and optimization.
  • Accelerating Innovation - When any team can access data with the tools of their choice, experimentation flourishes. Open data fosters Al/ML and cross-domain analytics since data isn't locked in silos - more innovation and insights from the same data.
  • Vendor Leverage - Strategically, using open standards increases your leverage in vendor negotiations. You car opt in or out of services more freely, pushing vendors to provide value (since you're not irreversibly locked to them).

#data #datalake #datalakehouse #medallion #architecture #snowflake #vilnius #lithuania #bigdata #event #meetup

👇 𝐑𝐞𝐜𝐚𝐩
✅ Lakehouse = flexibility + reliability
✅ Delta Lake & Iceberg solve schema/transaction headaches
✅ Pilot, migrate, optimize!
#𝐃𝐚𝐭𝐚𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 #𝐃𝐚𝐭𝐚𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 #𝐁𝐢𝐠𝐃𝐚𝐭𝐚 #𝐌𝐨𝐝𝐞𝐫𝐧𝐃𝐚𝐭𝐚𝐒𝐭𝐚𝐜𝐤
(7/7)

There is no need to move data. Data latency is minimised. Data can be transformed and analysed within a single platform.

Let me know what you know about Zero-ETL  

Why ETL-Zero? Understanding the shift in Data Integration“ by Sarah Lea on Medium: https://medium.com/towards-data-science/why-etl-zero-understanding-the-shift-in-data-integration-as-a-beginner-d0cefa244154

#python #datalake #cloudcomputing #etl #zeroetl #salesforce #data #tech #technology #datawarehousing #datalakehouse

Why ETL-Zero? Understanding the Shift in Data Integration

When I was preparing for the Salesforce Data Cloud certification, I came across the term Zero-ETL. The Data Cloud offers the possibility to access data directly from other systems such as data…

Towards Data Science
The house at the lake #3 - The Dashboard Diaries

When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability

The value of the lakehouse model, along with the concept of “shifting left” by moving more data modeling and processing from the data warehouse to the data lake, has seen significant buy-in and…

Data, Analytics & AI with Dremio
When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability

The value of the lakehouse model, along with the concept of “shifting left” by moving more data modeling and processing from the data warehouse to the data lake, has seen significant buy-in and…

Data, Analytics & AI with Dremio