As always: #OpenData persistently available at:
Du, K. (2025). Reconstructing Shuffled Text (Derived Text Formats) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17198425
#CLS #CCLS25 #DTF #LiteraryComputing #LLM #Memorization
Reconstructing Shuffled Text (Derived Text Formats)

This dataset contains all the results (including reconstructed texts, similarity scores etc.) of the reconstrution of DTF texts. The work is presented at the 4th Annual Conference of Computational Literary Studies, Krakow 2025. This dataset is also available in this GitHub repository. This work was created in the context of the work of the association German National Research Data Infrastructure (NFDI) e.V. NFDI is financed by the Federal Republic of Germany and the 16 federal states, and the consortium Text+ is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project number 460033370. The authors would like to thank for the funding and support. Furthermore, thanks also include all institutions and actors who are committed to the association and its goals.

Zenodo

In our own work, we researched memorization in language models for code and ways to let them regurgitate training data:

> From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model.

> We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack

https://dl.acm.org/doi/abs/10.1145/3597503.3639133

#memorization #atemlos

Traces of Memorisation in Large Language Models for Code | Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

ACM Conferences

Urteil GEMA gegen Open AI:

> Sowohl durch die Memorisierung in den Sprachmodellen als auch durch die Wiedergabe der Liedtexte in den Outputs des Chatbot lägen Eingriffe in die urheberrechtlichen Verwertungsrechte vor

https://www.justiz.bayern.de/gerichte-und-behoerden/landgericht/muenchen-1/presse/2025/11.php

#atemlos #openai #copyright #memorization #gema #chatgpt

Pressemitteilung 11/2025 - Bayerisches Staatsministerium der Justiz

From Memorization to Reasoning in the Spectrum of Loss Curvature

We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

arXiv.org
The New York Times thinks a turtle poem will "win your heart" 🐢💔—because nothing screams "captivating" like slow-moving reptiles and deep dives into poetic gravity. 🎼✨ Meanwhile, they offer a #game to help memorize it, as if anyone is clamoring to recite turtle verses at parties. 🎉📜
https://www.nytimes.com/interactive/2025/06/12/books/kay-ryan-turtle-poem.html #turtlepoem #NewYorkTimes #poetry #memorization #heartwarming #HackerNews #ngated
Slow and Steady, Kay Ryan’s “Turtle” Poem Will Win Your Heart

A.O. Scott ponders the specific gravity and unlikely grace of Kay Ryan’s “Turtle.” And we have a game to help you memorize it.

The New York Times
Interesting, "GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter."
https://venturebeat.com/ai/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell/
#ai #memorization #llm
How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

Using a clever solution, researchers find GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

VentureBeat
How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell https://venturebeat.com/ai/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell/ #AI #memorization #copyright
How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell https://venturebeat.com/ai/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell/ #AI #memorization #copyright
【生成AIパスポート試験対策】 GPTs - エミリーと学ぶ生成AIの世界

「【生成AIパスポート試験対策】 GPTs」 のご紹介です。 この試験に関する生成AI動画が、AIを…

エミリーと学ぶ生成AIの世界
×