Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Show thread

smells of bikes

@djnavarro ...wow I previously thought I had big datasets. Nope. This is amazing, and cool.

Show thread

Danielle Navarro Nov 21, 2022

@smellsofbikes it’s upsetting to me that my humble laptop can casually handle data sets this big. My boss is gently nudging me to learn how to make R do this at scale on kubernetes and whatevs but frankly I am already a bit overwhelmed