Mastodawn

Derek Mahar Apr 10, 2023

@mdfranz I think you may have encountered the same memory problem with DuckDB Python API method fetchall when querying the timestamp column (ts) in CloudTrail events table aws_cloudtrail. Do you observe the same memory behaviour when you query table aws_cloudtrail using DuckDB CLI? What do you observe if you query all columns from table aws_cloudtrail except the timestamp?

Show thread

Derek Mahar Apr 10, 2023

@mdfranz What column types does the Parquet file contain? I observed the increasing memory consumption problem only with the simple Python script that reads all rows of the Parquet file. DuckDB CLI no longer has this problem.

Show thread

Derek Mahar Apr 10, 2023

@rotnroll666 The #DuckDB COPY command is also very useful for converting CSV files to Parquet format and vice versa!

Show thread

Derek Mahar Apr 10, 2023

@rotnroll666 I agree that #DuckDB is great for querying small or huge CSV (and Parquet) files. I also often use #csvq to query smaller CSV files.

https://mithrandie.github.io/csvq/

csvq - SQL-like query language for csv

Show thread

Derek Mahar Apr 10, 2023

@mdfranz I have also observed a #duckdb Python script that gradually fills available memory as it repeatedly invokes method fetchmany to read the rows of a very large Parquet fille that contains a timestamp column. Change this timestamp column to an integer UNIX epoch timestamp and DuckDB reads the entire Parquet file while keeping its memory consumption remains stable.

https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection.fetchmany

Python Client API

DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

DuckDB

Show thread

Derek Mahar Feb 25, 2023

@kenkousen Did you happen to find the name of the Groovy library that helps writing scripts?

Show thread

Derek Mahar Dec 30, 2022

@kenkousen Is Binding related to http://groovy-lang.org/groovy-dev-kit.html#process-management? This is the kind of command pipeline that is trivial to construct in Bash, but more difficult in more general purpose programming languages.

The Apache Groovy programming language - The Groovy Development Kit

Derek Mahar Dec 29, 2022

@kenkousen Ken, can you refer me to some Groovy references about invoking external processes and building command pipelines? How might I use Groovy like I use Bash?

#groovy #java #programming #scripting