WAP паттерн в data-engineering

Несмотря на бурное развитие дата инжиниринга, WAP паттерн долгое время незаслуженно обходят стороной. Кто-то слышал о нем, но не применяет. Кто-то применяет, но интуитивно. В этой статье хочу на примере детально описать паттерн работы с данными, которому уже почти 8 лет, но за это время ни одна статья не была написана с принципом работы.

https://habr.com/ru/articles/937738/

#data_engineering #bigdata #big_data #data_warehouse #data_quality #warehouse #datalake #etl

WAP паттерн в data-engineering

В русскоязычной части интернета присутствует много статей по теме паттернов разработки, однако я не нашел никакой информации о паттернах работы с данными. В данной статье я хочу рассказать о паттерне...

Хабр

We’re excited to partner with Greptime to teach you how to set up a fully #FOSS observability stack — complete with a Prometheus Group compatible #datalake and real-time incident insights! https://t.ly/JNmvQ

#kubernetes #databases #devops #sre #freesoftware #sql #observability #ebpf #sysadmin #linux

📊 Your customer journeys are telling you something.
Are you listening or just watching clicks and opens?

Microsoft Fabric just changed the game for Customer Insights – Journeys users.
Now, every journey interaction, every click, every goal hit lives in OneLake, ready for real-time analysis.

👉 Curious, how are you analyzing journey drop-offs today?

#MicrosoftFabric #CustomerInsights #PowerPlatform #Dynamics365 #MarketingAnalytics #DataLake #PowerBI #MarketingOps

http://mytrial365.com/2025/08/14/customer-insights-fabric-the-marketing-analytics-match-you-didnt-know-you-needed/

Customer Insights + Fabric: The Marketing Analytics Match You Didn’t Know You Needed

Let’s face it, marketing data is powerful, but only if you can actually use it. If you’ve ever thought, “I know there’s valuable interaction data in my customer journeys… but where is it? And…

My Trial

There's a lot talk about "ZeroDisk" infrastructure backed by S3. The pitch is "move your data from locally attached NVMe storage to S3 and your applications will scale easier and be more performant!"

Maybe I'm getting too old for this shit, but I swear to dog this is the 4th such cycle in my career:

1. NFS
2. iSCSI / Fibrechannel
3. Hadoop / HDFS
4. ZeroDisk with S3

Am I the only one that's like: "wait, move TBs of data to S3 from NVMe to increase performance? Are you high?"

It doesn't work, so you scale up. Now you're back to local NVMe "cache disks" running instances as expensive as the locally attached NVMe instances when you add those costs to your S3 bill. The performance is worse because of course it is.

It always comes back to the two hard problems in computer science: naming things, cache invalidation, and off-by-one errors. 😂

#zerodisk #s3 #hadoop #cache #datalake #GetOffMyLawn

It is not possible to eliminate the risk of failures, but it is possible to mitigate them by making failures explainable, detectable, and manageable. https://hackernoon.com/diving-deep-into-data-lake-observability-why-it-matters-more-than-ever #datalake
Diving Deep Into Data Lake Observability: Why It Matters More Than Ever | HackerNoon

It is not possible to eliminate the risk of failures, but it is possible to mitigate them by making failures explainable, detectable, and manageable.

Simplified #metadata definition with the Data Catalog Schema Wizard

Data Fabric Cheat Sheet: #DataFabric #DataLake #InforOS. source

https://quadexcel.com/wp/simplified-metadata-definition-with-the-data-catalog-schema-wizard/

Simplified #metadata definition with the Data Catalog Schema Wizard - QuadExcel.com

Data Fabric Cheat Sheet: #DataFabric #DataLake #InforOS. source

QuadExcel.com

⬆️ Data volumes continue to rise. In fact, within industries like #engineering and #finance, the volume and volatility of log data have even outpaced the capacity of traditional #SIEM and analytics tools. 😰 What this means is... with orgs facing high costs and fatigue, the ones that thrive will be the ones that treat storage and retrieval as distinct functions. 🤔

This is where selective retrieval comes in—the ability to triage, park, and later selectively ingest high-volume data from a centralized repository for forensic or compliance-driven investigation. 🙌

Read this excellent article by #Graylog's Adam Abernethy in BigDATAwire to learn about:
🌏 Selective retrieval examples in the real world
⚠️ Risk coverage without always-on cost
🔒 Flexibility without architectural lock-in
💻 The technological shifts that are converging to make selective retrieval possible and necessary
↔️ How selective retrieval bridges the gap between data engineering complexity and #security usability
💼 The business case for selective retrieval, especially for mid-size IT teams
🛂 Regaining control over data sprawl
➕ More

https://www.bigdatawire.com/2025/07/14/rethinking-risk-the-role-of-selective-retrieval-in-data-lake-strategies/ #datalake #logdata #datamanagement @bigabe @bigdatawirenews

New project alert! Comparqter, a tool that compacts Parquet files and optimises file sizes.

https://codeberg.org/unticks/comparqter

#rust #parquet #s3 #datalake

comparqter

A small tool to compact Parquet files in an S3 bucket.

Codeberg.org

🎉 Huge thanks to the LanceDB CEO / cofounder Chang She for delivering an incredible talk on "Search, Retrieval, Training, and Analytics with Modern AI Data Lake" at #DataAndAIEngineering #SanFrancisco #meetup !

📹 Great news - the recording is now available! Check it out if you missed it or want to revisit the key concepts. 👇

https://watch.softinio.com/w/mVkLgtcQw8Qv5vA4v8SDHB

#DataEngineering #AIEngineering #SanFrancisco #LanceDB #DataLake #MachineLearning #VectorDB #Database #AI #ArtificialIntelligence

​Search, Retrieval, Training, and Analytics with Modern AI Data Lake By Chang She

PeerTube