Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Show thread

Ken Butler has moved Nov 21, 2022

@djnavarro this still seems like black magic to me (which presumably means that arrow is way cleverer than me).

Show thread

Danielle Navarro Nov 21, 2022

@nxskok to be fair it actually is black magic. The R package is doing weird lazy evaluation shit to make it all look normal, and the C++ code under the hood is… a lot. They are doing frightening things with Turing machines and I do not approve quite frankly

Show thread

Ken Butler has moved Nov 21, 2022

@djnavarro the first time I read it, I was expecting a collect() or something, and then I realised that would have brought the whole edifice (and your computer) crashing down.

Show thread

Danielle Navarro

@nxskok yeah collect() would absolutely cause a segfault or an error or something. You really can’t do that pipeline and return it to memory. Bad things happen

Show thread

Ken Butler has moved Nov 21, 2022

@djnavarro and then that makes me wonder how it can be done at all. Black magic!