FWIW: I'm rarely here these days, but academic Twitter is having something of a resurgence on Bluesky and I'm enjoying that a whole bunch
@alexpghayes.bsky.social if you'd like to connect there :)
| website | https://www.alexpghayes.com/ |
| github | https://github.com/alexpghayes |
FWIW: I'm rarely here these days, but academic Twitter is having something of a resurgence on Bluesky and I'm enjoying that a whole bunch
@alexpghayes.bsky.social if you'd like to connect there :)
Co-factor analysis of citation networks
Alex Hayes, Karl Rohe
https://arxiv.org/abs/2408.14604 https://arxiv.org/pdf/2408.14604 https://arxiv.org/html/2408.14604
arXiv:2408.14604v1 Announce Type: new
Abstract: One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct "co-factor" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more.
One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct "co-factor" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more.
Last week, the Wall Street Journal published a 10-minute-long interview with OpenAI CTO Mira Murati, with journalist Joanna Stern asking a series of thoughtful yet straightforward questions that Murati failed to satisfactorily answer. When asked about what data was used to train Sora, OpenAI's app for generating video with AI,
With this one the thrust is basically there are a number of "seemed like a good idea at the time" type approaches to reusing data analysis work that deliver benefit in the short term, but will get you absolutely wrecked by complexity and technical debt over the long term. I have found only one scalable way to manage the complexity of building data science capability. Yes it involves writing lots of packages 📦 📦 📦 📦 📦 😅
goal 3: learn new things about how people use computers!
One of my biggest frustations with programming is always: I’ll read some documentation, and I’ll wonder — okay, sure, but what are people ACTUALLY USING this software for? how are folks using it?
For example: it took me probably 3 years to figure out what kinds of problems strace is useful for. But we all use strace for the exact same things! This is a solved problem! (https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/)
(9/11)
We’re seeking input from #FOSS maintainers as we design a fellowship program pilot. We want to test a support mechanism that addresses structural issues in the FOSS ecosystem, and support maintainers who work on open digital infrastructure in the public interest.
If you maintain open source projects, we would be very grateful if you could take ten minutes to respond to the survey:
https://survey.sovereigntechfund.de/968766
Please also repost and share with FOSS maintainers you know. Thanks!