Mastodawn

As always: #OpenData persistently available at:
Du, K. (2025). Reconstructing Shuffled Text (Derived Text Formats) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17198425
#CLS #CCLS25 #DTF #LiteraryComputing #LLM #Memorization

Reconstructing Shuffled Text (Derived Text Formats)

This dataset contains all the results (including reconstructed texts, similarity scores etc.) of the reconstrution of DTF texts. The work is presented at the 4th Annual Conference of Computational Literary Studies, Krakow 2025. This dataset is also available in this GitHub repository. This work was created in the context of the work of the association German National Research Data Infrastructure (NFDI) e.V. NFDI is financed by the Federal Republic of Germany and the 16 federal states, and the consortium Text+ is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project number 460033370. The authors would like to thank for the funding and support. Furthermore, thanks also include all institutions and actors who are committed to the association and its goals.

Zenodo

Show thread

Arie van Deursen Nov 11

In our own work, we researched memorization in language models for code and ways to let them regurgitate training data:

> From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model.

> We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack

https://dl.acm.org/doi/abs/10.1145/3597503.3639133

#memorization #atemlos

Traces of Memorisation in Large Language Models for Code | Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

ACM Conferences

Arie van Deursen Nov 11

Urteil GEMA gegen Open AI:

> Sowohl durch die Memorisierung in den Sprachmodellen als auch durch die Wiedergabe der Liedtexte in den Outputs des Chatbot lägen Eingriffe in die urheberrechtlichen Verwertungsrechte vor

https://www.justiz.bayern.de/gerichte-und-behoerden/landgericht/muenchen-1/presse/2025/11.php

#atemlos #openai #copyright #memorization #gema #chatgpt

Pressemitteilung 11/2025 - Bayerisches Staatsministerium der Justiz

Hacker News Nov 7

From Memorization to Reasoning in the Spectrum of Loss Curvature

https://arxiv.org/abs/2510.24256

#HackerNews #Memorization #Reasoning #LossCurvature #MachineLearning #AIResearch

From Memorization to Reasoning in the Spectrum of Loss Curvature

We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

arXiv.org

エミリーと学ぶ生成AIの世界 Jul 4

【生成AIパスポート試験対策】 GPTs

https://www.aiandemily.com/%e3%80%90%e7%94%9f%e6%88%90ai%e3%83%91%e3%82%b9%e3%83%9d%e3%83%bc%e3%83%88%e8%a9%a6%e9%a8%93%e5%af%be%e7%ad%96%e3%80%91-gpts/?feed_id=1103&_unique_id=6867e0be3a198&Mastodon

#ExamPreparation #Memorization #LearningWithMusic #StudyMotivation #StudentSupport #ExamSuccess #勉強 #FunLearning #試験対策 #VocabularyBoost #暗記 #FocusMusic #勉強法 #Education #学生応援 #Learning #集中力アップ #Motivation #楽しく勉強 #StudyMusic #音楽で学ぶ #Knowledge #語彙力アップ #StudyTips #学習支援・翻訳 #学習支援...

N-gated Hacker News Jun 13

The New York Times thinks a turtle poem will "win your heart" 🐢💔—because nothing screams "captivating" like slow-moving reptiles and deep dives into poetic gravity. 🎼✨ Meanwhile, they offer a #game to help memorize it, as if anyone is clamoring to recite turtle verses at parties. 🎉📜
https://www.nytimes.com/interactive/2025/06/12/books/kay-ryan-turtle-poem.html #turtlepoem #NewYorkTimes #poetry #memorization #heartwarming #HackerNews #ngated