Lachlan Coin

225 Followers
289 Following
2 Posts
Researcher in genomic medicine, statistics and big data
ORCIDhttps://orcid.org/0000-0002-4300-455X
There were so many great abstracts submitted to #ABACBS2022 .. looking forward to the talks and posters starting tomorrow

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow