Mastodawn

Joran Nov 20, 2022

@djnavarro Whatttt?! That's amazing! 😃. Never knew that! 😊

Danielle Navarro Nov 20, 2022

@JoranJongerling Yeah, it's not well documented on the R package website. If you dig into the documentation of the underlying C++ library it talks about the backpressure feature that ensures the Dataset reader doesn't outpace the writer. I'm in the process of writing some PRs to make some of this more obvious in the vignettes, but it's a work in progress! 😁

Joran Nov 20, 2022

@djnavarro Aaaaah, that's how it works!! 😃. Very clever trick 😃. And very cool 😊.

And looking forward to reading the new vignettes! Thanks in advance for making/updating them 😊

Dr. Robert M Flight Nov 20, 2022

@JoranJongerling @djnavarro yeah, I didn't know this. That's a really neat trick. Nice to know this instead of trying to figure out some of the more esoteric "on disk" things.

resub Nov 20, 2022

@djnavarro wow! does the data need to be stored in parquet?

Danielle Navarro Nov 20, 2022

@resub Parquet helps, but it isn't required. You could do this with CSV files if you wanted, but it's much slower: seems to take about 10x as long with CSV

@djnavarro really need to learn arrow! I feel like I’ve incorrectly internalized, even loving R, that it “isn’t for big data”. I ended up on python when my first real use cases came and learned dask. Would open so much back up to escape pandas syntax with huge datasets.

@ryderdavid One really nice thing, if you decide to learn arrow, is that it's very interoperable with Python because the in-memory data structure is the same regardless of which language you're using. So it's pretty easy to pivot between R and Python within a workflow if you find yourself needing that.

(I wrote up some notes on it a couple months ago, actually: https://blog.djnavarro.net/posts/2022-09-09_reticulated-arrow/)

Notes from a data witch - Passing Arrow data between R and Python with reticulate

In a multi-language ‘polyglot’ data science world, it becomes important that we are able to pass large data sets efficiently from one language to another without making unnecessary copies of the data. This post is the story of how I learned to love using reticulate to pass data between R and Python, especially when used together with Apache Arrow which prevents wasteful copying when the handover takes place

@djnavarro oh thanks!! This might be the nice entry point for giving it a shot after weeks of yelling at .agg() functions

@ryderdavid oh god i feel your pain. I try not to be pissy about the Pandas API too much because Wes McKinney is CTO at Voltron Data and I like my job (I'm kidding of course, every interaction I've had with Wes has been great)... but goddamn it I just cannot think in Pandas syntax at all. If I have to use Python for data wrangling I sort of prefer Ibis (which also uses Arrow under the hood 😁 )

@djnavarro I love Wes from my impression online and think it’s obviously better than my rolling my own data frame (lol, what that would look like) but when I pivoted to cloud and had to upskill in py, coming from dplyr it was a huge sting. I am better now but it always feels like moving target (though I do love .apply(lambda x: )! Glad for dask where I don’t have to learn yet another syntax, but hasn’t been ideal and good time to unlearn that R is only for RAM-size. Thanks for the springboard!

@ryderdavid Good luck! And please feel free to ping me on here if you run into pain points. One of my jobs involves grumbling to the devs when things in arrow don't actually do what is advertised, so I'm happy to pass feedback upstairs

@djnavarro will definitely keep that in my pocket-even better incentive to jump in!

Patrick Anker 🫔Nov 21, 2022

@djnavarro That's impressive! I think the h2o benchmarks need to be updated for the 50GB range: https://h2oai.github.io/db-benchmark/

Question: what's the function of the grouping before export?

Database-like ops benchmark

@psanker Yeah those benchmarks are a little old

(Don't get me wrong though, I certainly don't think arrow is always the fastest or best solution, it just happens to be the one I know best)

The grouping in that context is used to structure the output: instead of writing to one huge parquet file, it's written to multiple files, each one containing 1 month of output data

J. Vandekerckhove Nov 21, 2022

@djnavarro That's neat. Which function is the magic here? Is it filter() that operates on file streams?

@joachim Lazy evaluation is the core magic. Under the hood, {arrow} supplies methods for most {dplyr} verbs that translate the R code to an Arrow execution plan, but nothing happens until the last step in the pipeline. All the actions are coordinated by a single Scanner object (blah blah) that has a backpressure feature: reading from disk slows down when memory cache fills up, allowing the compute engine and writer to have time to do their thing. Or at least that's my understanding 😁

@joachim I mean, in truth my job is basically to follow the trail of breadcrumbs the engineers have left behind in their partial documentation and find ways to translate it to complete docs... so take it all with a grain of salt. I might be full of shit, as you well know from my math psych work 🙃

J. Vandekerckhove Nov 21, 2022

@djnavarro 😁 Understood, I will immediately wager my life savings on this knowledge

@joachim as you should. I've never been wrong on anything of importance 🙃

Mathieu Caron-Diotte Nov 21, 2022

@djnavarro omg this is useful!

@mcarondiotte Honestly the reason it occurred to me that I should toot this (yes, I am sticking to "toot" no matter what Eugene says) was that someone emailed me a question about Arrow crashing when a query returns a larger-than-memory object. And I was like... omg, none of the documentation *ever* points out that you don't have to return the query result to an in-memory object, you can just write straight to disk...

Mathieu Caron-Diotte Nov 21, 2022

@djnavarro this is indeed something I would expect to be covered! 😊

Cameron Patrick Nov 21, 2022

@djnavarro File this under "I wish this had existed/I knew about this a few years ago when I was doing a consulting project with datasets that size".

@cameronpat *sighing loudly* yeah

smells of bikes Nov 21, 2022

@djnavarro ...wow I previously thought I had big datasets. Nope. This is amazing, and cool.

@smellsofbikes it’s upsetting to me that my humble laptop can casually handle data sets this big. My boss is gently nudging me to learn how to make R do this at scale on kubernetes and whatevs but frankly I am already a bit overwhelmed

Ken Butler has moved Nov 21, 2022

@djnavarro this still seems like black magic to me (which presumably means that arrow is way cleverer than me).

@nxskok to be fair it actually is black magic. The R package is doing weird lazy evaluation shit to make it all look normal, and the C++ code under the hood is… a lot. They are doing frightening things with Turing machines and I do not approve quite frankly

Ken Butler has moved Nov 21, 2022

@djnavarro the first time I read it, I was expecting a collect() or something, and then I realised that would have brought the whole edifice (and your computer) crashing down.

@nxskok yeah collect() would absolutely cause a segfault or an error or something. You really can’t do that pipeline and return it to memory. Bad things happen

Ken Butler has moved Nov 21, 2022

@djnavarro and then that makes me wonder how it can be done at all. Black magic!

Bruno Rodrigues

Nov 21, 2022

@djnavarro why do you group_by before writing? is it to create a folder structure for the parquet files that mimics the groups?

Jordi Rosell Nov 21, 2022

@djnavarro WTF you don't have time for a coffee.😂

irishoutsider Nov 21, 2022

@djnavarro

Florian Hollenbach Nov 21, 2022

@djnavarro I just recently discovered arrow (probably based on one of your toots/tweets) and it is life saving with big datasets. loving it so far.

Carl Boettiger Nov 22, 2022

@djnavarro ❤️ this, magical! Also I just still can't get over that this works with remote data sources (S3) as well, so we don't even need to download those 70 GB of data first.