My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

@djnavarro Whatttt?! That's amazing! 😃. Never knew that! 😊
@JoranJongerling Yeah, it's not well documented on the R package website. If you dig into the documentation of the underlying C++ library it talks about the backpressure feature that ensures the Dataset reader doesn't outpace the writer. I'm in the process of writing some PRs to make some of this more obvious in the vignettes, but it's a work in progress! 😁

@djnavarro Aaaaah, that's how it works!! 😃. Very clever trick 😃. And very cool 😊.

And looking forward to reading the new vignettes! Thanks in advance for making/updating them 😊

@JoranJongerling @djnavarro yeah, I didn't know this. That's a really neat trick. Nice to know this instead of trying to figure out some of the more esoteric "on disk" things.