I'd been reading about all these blogs about how slow #pandas is compared to #polars and #duckdb and finally did some benchmarks of my own of parquet reads. It's true. I think there are memory leaks in #duckdb parquet reader somehow related to threads. Worse in Python.

@mdfranz I have also observed a #duckdb Python script that gradually fills available memory as it repeatedly invokes method fetchmany to read the rows of a very large Parquet fille that contains a timestamp column. Change this timestamp column to an integer UNIX epoch timestamp and DuckDB reads the entire Parquet file while keeping its memory consumption remains stable.

https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection.fetchmany

Python Client API

DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

DuckDB
@derekmahar This is just during parquet load. 😭
@mdfranz What column types does the Parquet file contain? I observed the increasing memory consumption problem only with the simple Python script that reads all rows of the Parquet file. DuckDB CLI no longer has this problem.
matano-scripts/data/cloudtrail/duckdb at main · mdfranz/matano-scripts

Random Things for Interacting with Matano. Contribute to mdfranz/matano-scripts development by creating an account on GitHub.

GitHub
@mdfranz I think you may have encountered the same memory problem with DuckDB Python API method fetchall when querying the timestamp column (ts) in CloudTrail events table aws_cloudtrail. Do you observe the same memory behaviour when you query table aws_cloudtrail using DuckDB CLI? What do you observe if you query all columns from table aws_cloudtrail except the timestamp?