My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop ๐Ÿ™‚

#rstats #ApacheArrow

@djnavarro ...wow I previously thought I had big datasets. Nope. This is amazing, and cool.
@smellsofbikes itโ€™s upsetting to me that my humble laptop can casually handle data sets this big. My boss is gently nudging me to learn how to make R do this at scale on kubernetes and whatevs but frankly I am already a bit overwhelmed