Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Show thread

resub Nov 20, 2022

@djnavarro wow! does the data need to be stored in parquet?

Show thread

Danielle Navarro

@resub Parquet helps, but it isn't required. You could do this with CSV files if you wanted, but it's much slower: seems to take about 10x as long with CSV