Graham MacDonald

@grahamimac
140 Followers
90 Following
10 Posts
Chief Information Officer, Urban Institute. All opinions are my own.
Biohttps://www.urban.org/author/graham-macdonald
Bloghttps://urban-institute.medium.com/
Data Cataloghttps://datacatalog.urban.org/
Our year end wrap-up here at Urban: Check out how we do our work, behind the scenes, in our top 5 Data@Urban posts of 2022: https://urban-institute.medium.com/data-urbans-top-posts-of-2022-5f913e2b5196
Data@Urban’s Top Posts of 2022 - Data@Urban - Medium

Optical character recognition (OCR) services vary by cost, ease of use, confidentiality and ability to handle other types of data. We compared four examples that vary across these dimensions…

Medium

Since Hadley has announced it on twitter I will do the honours on here, but I'll forego the pirate-speak out of common decency...

There's a new chapter on #ApacheArrow and Parquet data in R4DS. It's mostly based on my work so please let me know if you spot any problems with the chapter and I promise to annoy Hadley with a pull request fixing it #RStats

https://r4ds.hadley.nz/arrow.html

22  Arrow – R for Data Science (2e)

It's crazy how much better the tooling has got for doing background research and literature reviews before tackling a topic to better understand what's already been done.

My workflow currently is https://elicit.org/ to discover and quickly summarize papers and https://www.researchrabbit.ai/ to dive deeper into related papers after that initial scan. What's yours?

My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats #ApacheArrow

Alright folks, I've made the jump from Twitter and will be fully committed here, as I'm finding it much more useful professionally. If there are folks you're loving on here that I should follow, let me know!

Hi folks! A lot more activity on here, so introducing myself.

I lead the Urban Institute's Technology and Data Science team, and post mostly about our innovative, cutting edge work in partnership with our top researchers, providing new data and analytics tools that help communities, organizations, advocates, and policymakers make better, more equitable decisions.

If that's your space, follow me and I'll likely follow you back!

My Mastodon tips so far:

- Use the home timeline to get info from the people you actually follow (!)
- Use the # Explore timeline to get the dopamine hit from the most popular tweets.

I'm currently doing both to ween myself off the addictive twitter scrolling, but as more of the people I like to follow move here, I'm hoping to just stick to the home timeline going forward.

How do federal government statistical agencies use data science in their work? A summary from Statistics Canada: https://hdsr.mitpress.mit.edu/pub/x0l4x099/release/1#data-science-applications