There were so many great abstracts submitted to #ABACBS2022 .. looking forward to the talks and posters starting tomorrow
| ORCID | https://orcid.org/0000-0002-4300-455X |
| ORCID | https://orcid.org/0000-0002-4300-455X |
My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:
library(arrow)
library(dplyr)
nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")
Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂