Mastodawn

There Will Be a Scientific Theory of Deep Learning

There Will Be a Scientific Theory of Deep Learning

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

arXiv.org

N-gated Hacker News 5h ago

🔍🤔 Oh no! A brave soul has confirmed the alarming #decline of #arXiv papers on Hacker News! 🎓📰 Quick, someone #alert the #academia police—our #intellectual #sanctuary is imploding! 🙄🚨
https://dylancastillo.co/til/llm-research-on-hacker-news-is-dying.html #news #HackerNews #ngated

LLM research on Hacker News is drying up – Dylan Castillo

Dylan Castillo

Francis Villatoro 7h ago

#arXiv Wave physics as a choreographic notation for partner dance arxiv.org/abs/2604.21918 in Bachata Sensual, a dance style in which the wave is the leitmotif, it is analysed three dance couples (Phase I) performing five movement sequences and one composite. cc @[email protected]

Curated Hacker News 9h ago

ML supports existence of unrecognized transient astronomical phenomena

https://arxiv.org/abs/2604.18799

#arxiv

Machine Learning Supports Existence of Previously Unrecognized Transient Astronomical Phenomena in Historical Observatory Images

Transient, star-like point sources that appear and vanish over short timescales are described in astronomical images prior to launch of Sputnik. We have reported that transient numbers diminish significantly in Earth's shadow (shadow deficit) and are more likely within (plus/minus) one day of nuclear testing (nuclear window). These findings remain debated with some arguing that transients identified via existing automated pipelines are simply plate defects. Therefore, we use machine learning (ML) to enhance transient identification accuracy and validate the phenomenon. The model was trained against 250 transient image pairs taken 30 minutes apart that were classified as real versus plate defect by expert visual review; the model demonstrated good discrimination (out-of-fold AUC$=$0.81; sensitivity$=$0.71, specificity$=$0.71). After deployment in a dataset of 107,875 previously-identified transients, the model assigned each a probability of being real. After controlling for ML-identified artifacts, transient counts were significantly elevated for dates within a nuclear window (p$=$.024); transients with the highest probability of being real were more likely to occur within a nuclear window (p$<$.0001). The shadow deficit was significant (p$<$.0001) and largest in the highest probability transients relative to lower probability transients (p$=$.003). Results strongly support existence of an unrecognized population of transient objects in historical astronomical plates warranting further study.

arXiv.org

Curated Hacker News 10h ago

Different Language Models Learn Similar Number Representations

https://arxiv.org/abs/2604.20817

#arxiv

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

arXiv.org

sayzard 2d ago

fly51fly (@fly51fly)

뇌과학과 인공지능의 발전을 연결하는 'NeuroAI and Beyond' 논문이 소개됐다. 신경과학의 통찰을 AI에 접목하는 방향을 다루며, 차세대 AI 연구의 핵심 주제인 NeuroAI의 확장 가능성을 제시한다.

https://x.com/fly51fly/status/2047062800795881797

#neuroai #neuroscience #artificialintelligence #airesearch #arxiv

fly51fly (@fly51fly) on X

[AI] NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence A Zador, J Fellous, T Sejnowski, G Adam… (2026) https://t.co/C3NucGs81k

X (formerly Twitter)

sayzard 2d ago

fly51fly (@fly51fly)

과학적 발견을 위해 평가 중심으로 모델과 실험 규모를 확장하는 'Evaluation-driven Scaling for Scientific Discovery' 연구가 소개됐다. 스탠퍼드대, 베이징대, 칭화대 연구진이 참여한 논문으로, AI를 활용한 과학 연구 가속화 가능성이 주목된다.

https://x.com/fly51fly/status/2047066214388863101

#scientificdiscovery #evaluation #scaling #airesearch #arxiv

fly51fly (@fly51fly) on X

[LG] Evaluation-driven Scaling for Scientific Discovery H Ye, H Lin, J Tang, Y Luo… [Stanford University & Peking University & Tsinghua University] (2026) https://t.co/dzCtgFclOY

X (formerly Twitter)

sayzard 2d ago

fly51fly (@fly51fly)

마이크로 언어 모델(Micro Language Models)이 즉각적인 응답을 가능하게 한다는 연구가 소개됐다. 메타 AI와 워싱턴대 연구진의 2026년 논문으로, 더 작은 모델로도 빠른 추론과 실시간 반응을 구현하는 방향의 기술 발전을 다룬다.

https://x.com/fly51fly/status/2047069038665482678

#languagemodel #smallmodel #inference #metai #arxiv

fly51fly (@fly51fly) on X

[CL] Micro Language Models Enable Instant Responses W Cheng, T Chen, K Helwani, S Srinivasan… [University of Washington & Meta AI] (2026) https://t.co/aRW3IkD7RA

X (formerly Twitter)

Curated Hacker News 2d ago

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

https://arxiv.org/abs/2604.15039

#arxiv

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth.

arXiv.org

sayzard 3d ago

fly51fly (@fly51fly)

코넬대와 MIT 연구진이 Sequential Monte Carlo를 활용해 LLM 추론 속도를 높이는 새로운 방법을 발표했다. 이 연구는 대규모 언어모델의 추론 효율을 개선해 더 빠르고 실용적인 배포를 가능하게 할 잠재력이 있다.

https://x.com/fly51fly/status/2046341249880469533

#llm #inference #sequentialmontecarlo #optimization #arxiv

fly51fly (@fly51fly) on X

[LG] Faster LLM Inference via Sequential Monte Carlo Y Emara, M B d Costa, C Chang, C Freer… [Cornell University & MIT] (2026) https://t.co/d4AiUwGReW

X (formerly Twitter)