Mastodawn

Matt Franz Apr 8, 2023

I'd been reading about all these blogs about how slow #pandas is compared to #polars and #duckdb and finally did some benchmarks of my own of parquet reads. It's true. I think there are memory leaks in #duckdb parquet reader somehow related to threads. Worse in Python.

Show thread

Derek Mahar Apr 10, 2023

@mdfranz I have also observed a #duckdb Python script that gradually fills available memory as it repeatedly invokes method fetchmany to read the rows of a very large Parquet fille that contains a timestamp column. Change this timestamp column to an integer UNIX epoch timestamp and DuckDB reads the entire Parquet file while keeping its memory consumption remains stable.

https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection.fetchmany

Python Client API

DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

DuckDB

Show thread

Matt Franz

@derekmahar This is just during parquet load. 😭

Show thread

Derek Mahar Apr 10, 2023

@mdfranz What column types does the Parquet file contain? I observed the increasing memory consumption problem only with the simple Python script that reads all rows of the Parquet file. DuckDB CLI no longer has this problem.

Show thread

Matt Franz Apr 10, 2023

@derekmahar See https://github.com/mdfranz/matano-scripts/tree/main/data/cloudtrail/duckdb

matano-scripts/data/cloudtrail/duckdb at main · mdfranz/matano-scripts

Random Things for Interacting with Matano. Contribute to mdfranz/matano-scripts development by creating an account on GitHub.

GitHub

Show thread

Derek Mahar Apr 10, 2023

@mdfranz I think you may have encountered the same memory problem with DuckDB Python API method fetchall when querying the timestamp column (ts) in CloudTrail events table aws_cloudtrail. Do you observe the same memory behaviour when you query table aws_cloudtrail using DuckDB CLI? What do you observe if you query all columns from table aws_cloudtrail except the timestamp?