There Will Be a Scientific Theory of Deep Learning

https://arxiv.org/abs/2604.21691

#arxiv

There Will Be a Scientific Theory of Deep Learning

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

arXiv.org
๐Ÿ”๐Ÿค” Oh no! A brave soul has confirmed the alarming #decline of #arXiv papers on Hacker News! ๐ŸŽ“๐Ÿ“ฐ Quick, someone #alert the #academia policeโ€”our #intellectual #sanctuary is imploding! ๐Ÿ™„๐Ÿšจ
https://dylancastillo.co/til/llm-research-on-hacker-news-is-dying.html #news #HackerNews #ngated
LLM research on Hacker News is drying up โ€“ Dylan Castillo

Dylan Castillo
#arXiv Wave physics as a choreographic notation for partner dance arxiv.org/abs/2604.21918 in Bachata Sensual, a dance style in which the wave is the leitmotif, it is analysed three dance couples (Phase I) performing five movement sequences and one composite. cc @[email protected]

ML supports existence of unrecognized transient astronomical phenomena

https://arxiv.org/abs/2604.18799

#arxiv

Machine Learning Supports Existence of Previously Unrecognized Transient Astronomical Phenomena in Historical Observatory Images

Transient, star-like point sources that appear and vanish over short timescales are described in astronomical images prior to launch of Sputnik. We have reported that transient numbers diminish significantly in Earth's shadow (shadow deficit) and are more likely within (plus/minus) one day of nuclear testing (nuclear window). These findings remain debated with some arguing that transients identified via existing automated pipelines are simply plate defects. Therefore, we use machine learning (ML) to enhance transient identification accuracy and validate the phenomenon. The model was trained against 250 transient image pairs taken 30 minutes apart that were classified as real versus plate defect by expert visual review; the model demonstrated good discrimination (out-of-fold AUC$=$0.81; sensitivity$=$0.71, specificity$=$0.71). After deployment in a dataset of 107,875 previously-identified transients, the model assigned each a probability of being real. After controlling for ML-identified artifacts, transient counts were significantly elevated for dates within a nuclear window (p$=$.024); transients with the highest probability of being real were more likely to occur within a nuclear window (p$<$.0001). The shadow deficit was significant (p$<$.0001) and largest in the highest probability transients relative to lower probability transients (p$=$.003). Results strongly support existence of an unrecognized population of transient objects in historical astronomical plates warranting further study.

arXiv.org

Different Language Models Learn Similar Number Representations

https://arxiv.org/abs/2604.20817

#arxiv

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

arXiv.org

fly51fly (@fly51fly)

๋‡Œ๊ณผํ•™๊ณผ ์ธ๊ณต์ง€๋Šฅ์˜ ๋ฐœ์ „์„ ์—ฐ๊ฒฐํ•˜๋Š” 'NeuroAI and Beyond' ๋…ผ๋ฌธ์ด ์†Œ๊ฐœ๋๋‹ค. ์‹ ๊ฒฝ๊ณผํ•™์˜ ํ†ต์ฐฐ์„ AI์— ์ ‘๋ชฉํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋‹ค๋ฃจ๋ฉฐ, ์ฐจ์„ธ๋Œ€ AI ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ์ฃผ์ œ์ธ NeuroAI์˜ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•œ๋‹ค.

https://x.com/fly51fly/status/2047062800795881797

#neuroai #neuroscience #artificialintelligence #airesearch #arxiv

fly51fly (@fly51fly) on X

[AI] NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence A Zador, J Fellous, T Sejnowski, G Adamโ€ฆ (2026) https://t.co/C3NucGs81k

X (formerly Twitter)

fly51fly (@fly51fly)

๊ณผํ•™์  ๋ฐœ๊ฒฌ์„ ์œ„ํ•ด ํ‰๊ฐ€ ์ค‘์‹ฌ์œผ๋กœ ๋ชจ๋ธ๊ณผ ์‹คํ—˜ ๊ทœ๋ชจ๋ฅผ ํ™•์žฅํ•˜๋Š” 'Evaluation-driven Scaling for Scientific Discovery' ์—ฐ๊ตฌ๊ฐ€ ์†Œ๊ฐœ๋๋‹ค. ์Šคํƒ ํผ๋“œ๋Œ€, ๋ฒ ์ด์ง•๋Œ€, ์นญํ™”๋Œ€ ์—ฐ๊ตฌ์ง„์ด ์ฐธ์—ฌํ•œ ๋…ผ๋ฌธ์œผ๋กœ, AI๋ฅผ ํ™œ์šฉํ•œ ๊ณผํ•™ ์—ฐ๊ตฌ ๊ฐ€์†ํ™” ๊ฐ€๋Šฅ์„ฑ์ด ์ฃผ๋ชฉ๋œ๋‹ค.

https://x.com/fly51fly/status/2047066214388863101

#scientificdiscovery #evaluation #scaling #airesearch #arxiv

fly51fly (@fly51fly) on X

[LG] Evaluation-driven Scaling for Scientific Discovery H Ye, H Lin, J Tang, Y Luoโ€ฆ [Stanford University & Peking University & Tsinghua University] (2026) https://t.co/dzCtgFclOY

X (formerly Twitter)

fly51fly (@fly51fly)

๋งˆ์ดํฌ๋กœ ์–ธ์–ด ๋ชจ๋ธ(Micro Language Models)์ด ์ฆ‰๊ฐ์ ์ธ ์‘๋‹ต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์†Œ๊ฐœ๋๋‹ค. ๋ฉ”ํƒ€ AI์™€ ์›Œ์‹ฑํ„ด๋Œ€ ์—ฐ๊ตฌ์ง„์˜ 2026๋…„ ๋…ผ๋ฌธ์œผ๋กœ, ๋” ์ž‘์€ ๋ชจ๋ธ๋กœ๋„ ๋น ๋ฅธ ์ถ”๋ก ๊ณผ ์‹ค์‹œ๊ฐ„ ๋ฐ˜์‘์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉํ–ฅ์˜ ๊ธฐ์ˆ  ๋ฐœ์ „์„ ๋‹ค๋ฃฌ๋‹ค.

https://x.com/fly51fly/status/2047069038665482678

#languagemodel #smallmodel #inference #metai #arxiv

fly51fly (@fly51fly) on X

[CL] Micro Language Models Enable Instant Responses W Cheng, T Chen, K Helwani, S Srinivasanโ€ฆ [University of Washington & Meta AI] (2026) https://t.co/aRW3IkD7RA

X (formerly Twitter)

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

https://arxiv.org/abs/2604.15039

#arxiv

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth.

arXiv.org

fly51fly (@fly51fly)

์ฝ”๋„ฌ๋Œ€์™€ MIT ์—ฐ๊ตฌ์ง„์ด Sequential Monte Carlo๋ฅผ ํ™œ์šฉํ•ด LLM ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ด๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ๋ฐœํ‘œํ–ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์˜ ์ถ”๋ก  ํšจ์œจ์„ ๊ฐœ์„ ํ•ด ๋” ๋น ๋ฅด๊ณ  ์‹ค์šฉ์ ์ธ ๋ฐฐํฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•  ์ž ์žฌ๋ ฅ์ด ์žˆ๋‹ค.

https://x.com/fly51fly/status/2046341249880469533

#llm #inference #sequentialmontecarlo #optimization #arxiv

fly51fly (@fly51fly) on X

[LG] Faster LLM Inference via Sequential Monte Carlo Y Emara, M B d Costa, C Chang, C Freerโ€ฆ [Cornell University & MIT] (2026) https://t.co/d4AiUwGReW

X (formerly Twitter)