Mastodawn

Okay this is kind of absurd but I guess a learning experience. Merging lots of data in #Python using #PythonPandas and a couple hundred frames in, it slows to a crawl. 15+ min and I abort.

Instead do batch processing of a hundred files a piece, then merge the resulting Parquet files again. Less than five minutes.

#TIL

Show thread

Sevoris Jul 22, 2023

This wasn‘t a thorough debug but it appears working with hundreds of Dataframes during appends creates a lot of memory overhead by default - irregardless of the rows count. This issue does not occur if you work with a few frames even if they have more rows.

Go figure. Or maybe the switch to Parquet has secondary consequences. In the end it wasn‘t *too hard* to work around this.