Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Show thread

Joran Nov 20, 2022

@djnavarro Whatttt?! That's amazing! 😃. Never knew that! 😊

Show thread

Danielle Navarro

@JoranJongerling Yeah, it's not well documented on the R package website. If you dig into the documentation of the underlying C++ library it talks about the backpressure feature that ensures the Dataset reader doesn't outpace the writer. I'm in the process of writing some PRs to make some of this more obvious in the vignettes, but it's a work in progress! 😁

Show thread

Joran Nov 20, 2022

@djnavarro Aaaaah, that's how it works!! 😃. Very clever trick 😃. And very cool 😊.

And looking forward to reading the new vignettes! Thanks in advance for making/updating them 😊

Show thread

Dr. Robert M Flight Nov 20, 2022

@JoranJongerling @djnavarro yeah, I didn't know this. That's a really neat trick. Nice to know this instead of trying to figure out some of the more esoteric "on disk" things.