alex hayes

@alexpghayes
1,063 Followers
401 Following
163 Posts
stats phd candidate @ uw-madison. networks, causal inference & #rstats 🎉. he/him
websitehttps://www.alexpghayes.com/
githubhttps://github.com/alexpghayes

FWIW: I'm rarely here these days, but academic Twitter is having something of a resurgence on Bluesky and I'm enjoying that a whole bunch

@alexpghayes.bsky.social if you'd like to connect there :)

Co-factor analysis of citation networks

Alex Hayes, Karl Rohe
https://arxiv.org/abs/2408.14604 https://arxiv.org/pdf/2408.14604 https://arxiv.org/html/2408.14604

arXiv:2408.14604v1 Announce Type: new
Abstract: One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct "co-factor" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more.

Co-factor analysis of citation networks

One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct "co-factor" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more.

arXiv.org
The New York Attorney General's office is hiring a data analyst "to explore, analyze, and create new datasets for information pertinent to OAG investigations. The goal of the data analyst at the OAG is to support a range of investigations and initiatives with consistent, high quality, reproducible research that is communicated in clear and compelling ways." https://www.linkedin.com/posts/gsisodia_come-join-the-research-and-analytics-department-activity-7208578629176299522-06QR/
#NewYork #NYC #DataScience #rstats #python #job #GetFediHired
Gautam Sisodia on LinkedIn: Come join the Research and Analytics Department at the New York State…

Come join the Research and Analytics Department at the New York State Attorney General's Office! We are looking to hire a data analyst. All of what I said in…

Periodic reminder: The only way to write good code is to write tons of shitty code first. Feeling shame about bad code stops you from getting to good code.
is {ggtext} still the best option for multi-color text in {ggplot2}, or has it been superseded by alternatives? #rstats
Interesting part-time opportunity for a grad student interested in data science education: https://www.datascience4everyone.org/jobs It's paid, remote, 10 hours a week. Deadline for applications April 30, 2024
Job Opportunities | DS4E

Looking for employment, volunteer, or internships opportunities in data science education? Join our team at Data Science 4 Everyone, based at the University of Chicago.

K12data
Brutal, but IMHO quite likely correct:
"Sam Altman desperately needs you to believe that generative AI will be essential, inevitable and intractable, because if you don't, you'll suddenly realize that trillions of dollars of market capitalization and revenue are being blown on something remarkably mediocre." 1/2
https://www.wheresyoured.at/peakai/
#AI #hype
Have We Reached Peak AI?

Last week, the Wall Street Journal published a 10-minute-long interview with OpenAI CTO Mira Murati, with journalist Joanna Stern asking a series of thoughtful yet straightforward questions that Murati failed to satisfactorily answer. When asked about what data was used to train Sora, OpenAI's app for generating video with AI,

Ed Zitron's Where's Your Ed At

With this one the thrust is basically there are a number of "seemed like a good idea at the time" type approaches to reusing data analysis work that deliver benefit in the short term, but will get you absolutely wrecked by complexity and technical debt over the long term. I have found only one scalable way to manage the complexity of building data science capability. Yes it involves writing lots of packages 📦 📦 📦 📦 📦 😅

https://www.milesmcbain.com/posts/data-analysis-reuse/

#rstats #DataScience

Before I Sleep: Patterns and anti-patterns of data analysis reuse

A speed-run through four stages of data analysis reuse, to the end game you probably guessed was coming.

Before I Sleep

goal 3: learn new things about how people use computers!

One of my biggest frustations with programming is always: I’ll read some documentation, and I’ll wonder — okay, sure, but what are people ACTUALLY USING this software for? how are folks using it?

For example: it took me probably 3 years to figure out what kinds of problems strace is useful for. But we all use strace for the exact same things! This is a solved problem! (https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/)

(9/11)

What problems do people solve with strace?

What problems do people solve with strace?

Julia Evans

We’re seeking input from #FOSS maintainers as we design a fellowship program pilot. We want to test a support mechanism that addresses structural issues in the FOSS ecosystem, and support maintainers who work on open digital infrastructure in the public interest.

If you maintain open source projects, we would be very grateful if you could take ten minutes to respond to the survey:
https://survey.sovereigntechfund.de/968766

Please also repost and share with FOSS maintainers you know. Thanks!

STF Open Source Fellowship