Mastodawn

JD Long ✅Nov 27, 2022

I see lots of confusion around why Polars & DuckDB are faster than Pandas. It’s a combination of three things:

1) Polars & Duck are multithreaded compiled libraries while Pandas is single threaded mix of compiled numpy & Py code

2) Polars & Duck both have query optimizers that plan execution based on where your code is going. Pandas just does the steps you tell it to, in the order you tell it.

3) Polars & Duck “stream” from disk meaning they can operate on more data than will fit in RAM.

Show thread

Marcos Huerta

@Cmastication Polars only works with local file systems which seems a little limiting. Won’t big data sets that could benefit from the streaming often be in cloud storage not on a local disk?

(Duckdb in theory works but is dependent on third party file system libraries (like the Azure Blob Filesystem)).