Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Show thread

Patrick Anker 🫔

@djnavarro That's impressive! I think the h2o benchmarks need to be updated for the 50GB range: https://h2oai.github.io/db-benchmark/

Question: what's the function of the grouping before export?

Database-like ops benchmark

Show thread

Danielle Navarro Nov 21, 2022

@psanker Yeah those benchmarks are a little old

(Don't get me wrong though, I certainly don't think arrow is always the fastest or best solution, it just happens to be the one I know best)

The grouping in that context is used to structure the output: instead of writing to one huge parquet file, it's written to multiple files, each one containing 1 month of output data