Derek Mahar

4 Followers
61 Following
10 Posts
Software developer in Montreal, Quebec, Canada. Interested in programming languages, data science, machine learning, cryptocurrency, and finance.
@mdfranz I think you may have encountered the same memory problem with DuckDB Python API method fetchall when querying the timestamp column (ts) in CloudTrail events table aws_cloudtrail. Do you observe the same memory behaviour when you query table aws_cloudtrail using DuckDB CLI? What do you observe if you query all columns from table aws_cloudtrail except the timestamp?
@mdfranz What column types does the Parquet file contain? I observed the increasing memory consumption problem only with the simple Python script that reads all rows of the Parquet file. DuckDB CLI no longer has this problem.
@rotnroll666 The #DuckDB COPY command is also very useful for converting CSV files to Parquet format and vice versa!

@rotnroll666 I agree that #DuckDB is great for querying small or huge CSV (and Parquet) files. I also often use #csvq to query smaller CSV files.

https://mithrandie.github.io/csvq/

csvq - SQL-like query language for csv

@mdfranz I have also observed a #duckdb Python script that gradually fills available memory as it repeatedly invokes method fetchmany to read the rows of a very large Parquet fille that contains a timestamp column. Change this timestamp column to an integer UNIX epoch timestamp and DuckDB reads the entire Parquet file while keeping its memory consumption remains stable.

https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection.fetchmany

Python Client API

DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

DuckDB
@kenkousen Did you happen to find the name of the Groovy library that helps writing scripts?
@kenkousen Is Binding related to http://groovy-lang.org/groovy-dev-kit.html#process-management? This is the kind of command pipeline that is trivial to construct in Bash, but more difficult in more general purpose programming languages.
The Apache Groovy programming language - The Groovy Development Kit

@kenkousen Ken, can you refer me to some Groovy references about invoking external processes and building command pipelines? How might I use Groovy like I use Bash?

#groovy #java #programming #scripting