Mastodawn

Danielle Navarro Nov 20, 2022

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

@djnavarro really need to learn arrow! I feel like I’ve incorrectly internalized, even loving R, that it “isn’t for big data”. I ended up on python when my first real use cases came and learned dask. Would open so much back up to escape pandas syntax with huge datasets.

Show thread

Danielle Navarro Nov 21, 2022

@ryderdavid One really nice thing, if you decide to learn arrow, is that it's very interoperable with Python because the in-memory data structure is the same regardless of which language you're using. So it's pretty easy to pivot between R and Python within a workflow if you find yourself needing that.

(I wrote up some notes on it a couple months ago, actually: https://blog.djnavarro.net/posts/2022-09-09_reticulated-arrow/)

Notes from a data witch - Passing Arrow data between R and Python with reticulate

In a multi-language ‘polyglot’ data science world, it becomes important that we are able to pass large data sets efficiently from one language to another without making unnecessary copies of the data. This post is the story of how I learned to love using reticulate to pass data between R and Python, especially when used together with Apache Arrow which prevents wasteful copying when the handover takes place

Show thread

Ryder C

🔭 🚲Nov 21, 2022

@djnavarro oh thanks!! This might be the nice entry point for giving it a shot after weeks of yelling at .agg() functions

Show thread

Danielle Navarro Nov 21, 2022

@ryderdavid oh god i feel your pain. I try not to be pissy about the Pandas API too much because Wes McKinney is CTO at Voltron Data and I like my job (I'm kidding of course, every interaction I've had with Wes has been great)... but goddamn it I just cannot think in Pandas syntax at all. If I have to use Python for data wrangling I sort of prefer Ibis (which also uses Arrow under the hood 😁 )

Show thread

Ryder C

🔭 🚲

@djnavarro I love Wes from my impression online and think it’s obviously better than my rolling my own data frame (lol, what that would look like) but when I pivoted to cloud and had to upskill in py, coming from dplyr it was a huge sting. I am better now but it always feels like moving target (though I do love .apply(lambda x: )! Glad for dask where I don’t have to learn yet another syntax, but hasn’t been ideal and good time to unlearn that R is only for RAM-size. Thanks for the springboard!

Show thread

Danielle Navarro Nov 21, 2022

@ryderdavid Good luck! And please feel free to ping me on here if you run into pain points. One of my jobs involves grumbling to the devs when things in arrow don't actually do what is advertised, so I'm happy to pass feedback upstairs

Show thread

Ryder C

🔭 🚲Nov 21, 2022

@djnavarro will definitely keep that in my pocket-even better incentive to jump in!