My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

@djnavarro this still seems like black magic to me (which presumably means that arrow is way cleverer than me).
@nxskok to be fair it actually is black magic. The R package is doing weird lazy evaluation shit to make it all look normal, and the C++ code under the hood is… a lot. They are doing frightening things with Turing machines and I do not approve quite frankly
@djnavarro the first time I read it, I was expecting a collect() or something, and then I realised that would have brought the whole edifice (and your computer) crashing down.
@nxskok yeah collect() would absolutely cause a segfault or an error or something. You really can’t do that pipeline and return it to memory. Bad things happen
@djnavarro and then that makes me wonder how it can be done at all. Black magic!