Mastodawn

Lachlan Coin

@[email protected]

225 Followers

289 Following

2 Posts

Researcher in genomic medicine, statistics and big data

ORCID

https://orcid.org/0000-0002-4300-455X

Lachlan Coin Nov 28, 2022

There were so many great abstracts submitted to #ABACBS2022 .. looking forward to the talks and posters starting tomorrow

Lachlan Coin Nov 20, 2022

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow