Mastodawn

Treating SparkContext as a control tower shifts how you think about Spark: not just as an API, but as the coordinator for your entire distributed engine.

#ApacheSpark #SparkContext #distributed #systems

Show thread

Ronan May 4

AQE (Adaptive Query Execution) : adapte le plan d'exécution en temps réel

DPP (Dynamic Partition Pruning) : ne lit que les partitions utiles pendant une jointure

SPJ (Storage Partition Join) : évite le shuffle en utilisant le partitionnement existant

#dataengineering #apachespark

Ronan May 4

https://luminousmen.com/post/the-apache-spark-optimization-checklist/ (en)

Comment optimiser Apache Spark ?
1. Utiliser les API DataFrame / Dataset, pas RDD.
2. Filtrer tôt, filtrer fort.
3. Trouver le data skew.
4. Connaitre AQE, DPP, SPJ.
5. Regarder l'UI.

#dataengineering #apachespark

The Apache Spark Optimization Checklist

Discover essential Apache Spark optimization tips in this comprehensive checklist distilled from real-life incidents. Learn how to enhance performance and avoid common mistakes. Subscribe for more insights!

Blog | iamluminousmen

deitel May 3

Join me Tuesday for my next Python Data Science & AI Full Throttle! https://deitel.com/PYDSFT

O'Reilly Media Pearson #deitel #python #machinelearning #deeplearning #NLP #datamining #ApacheSpark #BigData #IoT #GenAI

InfoQ Apr 8

96% fewer out-of-memory (OOM) failures!

#Pinterest shared how it improved the reliability of its #ApacheSpark workloads.

By focusing on:
✅ Enhanced observability
✅ Configuration tuning
✅ Automatic memory retries

The changes addressed persistent job failures affecting recommendation systems and large-scale data processing.

Details here ⇨ https://bit.ly/4smqrQD

#SoftwareArchitecture #BigData #CostOptimization #Memory #DistributedSystems #Observability #InfoQ

TechLİfe Mar 31

The Data Lakehouse Explained: Why Apache Iceberg Is Quietly Running the Show

https://techlife.blog/posts/data-lakehouse-iceberg

#ApacheIceberg #DataLakehouse #DataWarehouse #DataLake #Snowflake #ApacheSpark #DataEngineering

The Data Lakehouse Explained: Why Apache Iceberg Is Quietly Running the Show

Data warehouses were expensive. Data lakes turned into swamps. Enter the Lakehouse — and the open table format that makes it actually work.

TechLife — AI, Software Engineering & Emerging Technology

Igor De Souza Mar 17

Kafka vs Flink vs Spark Streaming: What Nobody Tells You Before You Pick One

#apachekafka
#apacheflink
#apachespark

https://alper-korukcu.medium.com/kafka-vs-flink-vs-spark-streaming-what-nobody-tells-you-before-you-pick-one-aa83c26a287a

Kafka vs Flink vs Spark Streaming: What Nobody Tells You Before You Pick One

You’re comparing three things that aren’t the same thing. That’s the first problem. Kafka is a messaging backbone. Flink is a stream…

Medium

Holden Mar 4

Bellevue / Seattle area friends: I’m super stoked for next week’s Spark Community Spring (Friday Mar 13th: spooky 👻).

If you’ve ever wanted to contribute to Apache Spark, come hang out and get your first Spark PR started with Felix Cheung, Huaxin Gao, Devin Petersohn, and myself :)

We’ll help folks find starter issues, get their dev environments set up, and walk through the contribution process.

There will be free lunch, and if enough people show up… maybe even Taco Bell for an afternoon snack*.

#ApacheSpark #OSS #hackathon #freelunch #tacofridaymaaaaybe

https://luma.com/rrfvx0ey

(* Depends on attendance)

Apache Spark™ Community Sprint · Luma

Apache Spark™ Community Sprint! Join us on March 17th (Tuesday) from 12:00-7:00 PM at the Snowflake Bellevue Office for a Spark community sprint! We'll spend…

InfoQ Mar 3

#Pinterest launched a next-gen CDC-based ingestion framework.

Using #ApacheKafka, #ApacheFlink, #ApacheSpark & #ApacheIceberg, they achieved:
• Latency cut from 24+ hours to 15 minutes
• Processing of only changed records
• Support for incremental updates & deletions
• Petabyte-scale data across 1,000+ pipelines

Win: optimized cost & efficiency!

Read the architectural deep dive on InfoQ 👉 https://bit.ly/4rMJB2H

#SoftwareArchitecture #ChangeDataCapture

arcofai Feb 10

🚀 Big Data meets AI—powered by Iceberg, Spark & LLMs

At #ArcOfAI, Pratik Patel shows how to build a real architecture that lets users query massive datasets with natural language—no dashboards, no SQL, just questions & insights.

https://www.arcofai.com/speaker/1c241471d7f04018a0da70efffd35b32

🎟️ Get tickets: https://arcofai.com

#ArtificialIntelligence #BigData #DataArchitecture #ApacheSpark #ApacheIceberg #LLM #GenAI #EventStreaming #Kafka #Flink #AIEngineering #TechLeadership