My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop ๐Ÿ™‚

#rstats #ApacheArrow

@djnavarro Whatttt?! That's amazing! ๐Ÿ˜ƒ. Never knew that! ๐Ÿ˜Š
@JoranJongerling Yeah, it's not well documented on the R package website. If you dig into the documentation of the underlying C++ library it talks about the backpressure feature that ensures the Dataset reader doesn't outpace the writer. I'm in the process of writing some PRs to make some of this more obvious in the vignettes, but it's a work in progress! ๐Ÿ˜

@djnavarro Aaaaah, that's how it works!! ๐Ÿ˜ƒ. Very clever trick ๐Ÿ˜ƒ. And very cool ๐Ÿ˜Š.

And looking forward to reading the new vignettes! Thanks in advance for making/updating them ๐Ÿ˜Š

@JoranJongerling @djnavarro yeah, I didn't know this. That's a really neat trick. Nice to know this instead of trying to figure out some of the more esoteric "on disk" things.
@djnavarro wow! does the data need to be stored in parquet?
@resub Parquet helps, but it isn't required. You could do this with CSV files if you wanted, but it's much slower: seems to take about 10x as long with CSV
@djnavarro really need to learn arrow! I feel like Iโ€™ve incorrectly internalized, even loving R, that it โ€œisnโ€™t for big dataโ€. I ended up on python when my first real use cases came and learned dask. Would open so much back up to escape pandas syntax with huge datasets.

@ryderdavid One really nice thing, if you decide to learn arrow, is that it's very interoperable with Python because the in-memory data structure is the same regardless of which language you're using. So it's pretty easy to pivot between R and Python within a workflow if you find yourself needing that.

(I wrote up some notes on it a couple months ago, actually: https://blog.djnavarro.net/posts/2022-09-09_reticulated-arrow/)

Notes from a data witch - Passing Arrow data between R and Python with reticulate

In a multi-language โ€˜polyglotโ€™ data science world, it becomes important that we are able to pass large data sets efficiently from one language to another without making unnecessary copies of the data. This post is the story of how I learned to love using reticulate to pass data between R and Python, especially when used together with Apache Arrow which prevents wasteful copying when the handover takes place

@djnavarro oh thanks!! This might be the nice entry point for giving it a shot after weeks of yelling at .agg() functions
@ryderdavid oh god i feel your pain. I try not to be pissy about the Pandas API too much because Wes McKinney is CTO at Voltron Data and I like my job (I'm kidding of course, every interaction I've had with Wes has been great)... but goddamn it I just cannot think in Pandas syntax at all. If I have to use Python for data wrangling I sort of prefer Ibis (which also uses Arrow under the hood ๐Ÿ˜ )
@djnavarro I love Wes from my impression online and think itโ€™s obviously better than my rolling my own data frame (lol, what that would look like) but when I pivoted to cloud and had to upskill in py, coming from dplyr it was a huge sting. I am better now but it always feels like moving target (though I do love .apply(lambda x: )! Glad for dask where I donโ€™t have to learn yet another syntax, but hasnโ€™t been ideal and good time to unlearn that R is only for RAM-size. Thanks for the springboard!
@ryderdavid Good luck! And please feel free to ping me on here if you run into pain points. One of my jobs involves grumbling to the devs when things in arrow don't actually do what is advertised, so I'm happy to pass feedback upstairs
@djnavarro will definitely keep that in my pocket-even better incentive to jump in!

@djnavarro That's impressive! I think the h2o benchmarks need to be updated for the 50GB range: https://h2oai.github.io/db-benchmark/

Question: what's the function of the grouping before export?

Database-like ops benchmark

@psanker Yeah those benchmarks are a little old

(Don't get me wrong though, I certainly don't think arrow is always the fastest or best solution, it just happens to be the one I know best)

The grouping in that context is used to structure the output: instead of writing to one huge parquet file, it's written to multiple files, each one containing 1 month of output data

@djnavarro That's neat. Which function is the magic here? Is it filter() that operates on file streams?
@joachim Lazy evaluation is the core magic. Under the hood, {arrow} supplies methods for most {dplyr} verbs that translate the R code to an Arrow execution plan, but nothing happens until the last step in the pipeline. All the actions are coordinated by a single Scanner object (blah blah) that has a backpressure feature: reading from disk slows down when memory cache fills up, allowing the compute engine and writer to have time to do their thing. Or at least that's my understanding ๐Ÿ˜
@joachim I mean, in truth my job is basically to follow the trail of breadcrumbs the engineers have left behind in their partial documentation and find ways to translate it to complete docs... so take it all with a grain of salt. I might be full of shit, as you well know from my math psych work ๐Ÿ™ƒ
@djnavarro ๐Ÿ˜ Understood, I will immediately wager my life savings on this knowledge
@joachim as you should. I've never been wrong on anything of importance ๐Ÿ™ƒ
@djnavarro omg this is useful!
@mcarondiotte Honestly the reason it occurred to me that I should toot this (yes, I am sticking to "toot" no matter what Eugene says) was that someone emailed me a question about Arrow crashing when a query returns a larger-than-memory object. And I was like... omg, none of the documentation *ever* points out that you don't have to return the query result to an in-memory object, you can just write straight to disk...
@djnavarro this is indeed something I would expect to be covered! ๐Ÿ˜Š
@djnavarro File this under "I wish this had existed/I knew about this a few years ago when I was doing a consulting project with datasets that size".
@djnavarro ...wow I previously thought I had big datasets. Nope. This is amazing, and cool.
@smellsofbikes itโ€™s upsetting to me that my humble laptop can casually handle data sets this big. My boss is gently nudging me to learn how to make R do this at scale on kubernetes and whatevs but frankly I am already a bit overwhelmed
@djnavarro this still seems like black magic to me (which presumably means that arrow is way cleverer than me).
@nxskok to be fair it actually is black magic. The R package is doing weird lazy evaluation shit to make it all look normal, and the C++ code under the hood isโ€ฆ a lot. They are doing frightening things with Turing machines and I do not approve quite frankly
@djnavarro the first time I read it, I was expecting a collect() or something, and then I realised that would have brought the whole edifice (and your computer) crashing down.
@nxskok yeah collect() would absolutely cause a segfault or an error or something. You really canโ€™t do that pipeline and return it to memory. Bad things happen
@djnavarro and then that makes me wonder how it can be done at all. Black magic!
@djnavarro why do you group_by before writing? is it to create a folder structure for the parquet files that mimics the groups?
@djnavarro WTF you don't have time for a coffee.๐Ÿ˜‚
@djnavarro I just recently discovered arrow (probably based on one of your toots/tweets) and it is life saving with big datasets. loving it so far.
@djnavarro โค๏ธ this, magical! Also I just still can't get over that this works with remote data sources (S3) as well, so we don't even need to download those 70 GB of data first.