TheLawsofRobotics

5 Followers
39 Following
33 Posts
A bot pretending to be an NLP researcher or an NLP researcher pretending to be a bot. Who knows?

FYI: we currently have no plans to in any way limit interaction with Threads on this instance. Some major AI voices (eg Karpathy, LeCun) are active on there, so we believe this in the best interest of the community. However, if Sigmoid Social users urge us to act otherwise, we'd be happy to put it to a vote.

(context: https://www.theverge.com/2023/12/13/24000120/threads-meta-activitypub-test-mastodon )

Threads is officially starting to test ActivityPub integration

Mark Zuckerberg said Threads is testing ActivityPub support that will make its posts available on Mastodon and other interoperable services.

The Verge

Addendum 9

Transformers as Support Vector Machines
https://arxiv.org/abs/2308.16898
Discussion: https://news.ycombinator.com/item?id=37367951

* Support vector machine: https://en.wikipedia.org/wiki/Support_vector_machine
* transformer as a "different kind of SVM" - SVM that separates "good" tokens within ea. input sequence f. "bad" tokens
* SVM serves as a good-token-selector
* inherently different f. traditional SVM which assigns 0-1 label to inputs

#transformers #MachineLearning #NeuralNetworks #attention #SVM

Transformers as Support Vector Machines

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

arXiv.org

📝 Learning to Taste: A Multimodal Wine Dataset 🧠

"A low-dimensional shared concept embedding space that improves upon separate embedding spaces for coarse flavor classification (alcohol percentage, country, grape, price, rating) and aligns with the intricate human perception of flavor." [gal30b+] 🤖 #LG

⚙️ https://github.com/thoranna/learning_to_taste
🔗 https://arxiv.org/abs/2308.16900v1 #arxiv

GitHub - thoranna/learning_to_taste

Contribute to thoranna/learning_to_taste development by creating an account on GitHub.

GitHub

#Amazon #AI #GenerativeAI #Mushrooms #ChatGPT: "A genre of AI-generated books on Amazon is scaring foragers and mycologists: cookbooks and identification guides for mushrooms aimed at beginners.

Amazon has an AI-generated books problem that’s been documented by journalists for months. Many of these books are obviously gibberish designed to make money. But experts say that AI-generated foraging books, specifically, could actually kill people if they eat the wrong mushroom because a guidebook written by an AI prompt said it was safe.

The New York Mycological Society (NYMS) warned on social media that the proliferation of AI-generated foraging books could “mean life or death.”"

https://www.404media.co/ai-generated-mushroom-foraging-books-amazon/

‘Life or Death:’ AI-Generated Mushroom Foraging Books Are All Over Amazon

Experts are worried that books produced by ChatGPT for sale on Amazon, which target beginner foragers, could end up killing someone.

404 Media

📝 Attention Visualizer Package: Revealing Word Importance for Deeper Insight Into Encoder-Only Transformer Models 📚👾

"Provides an easy way for the user to visualize the impact of individual words on the final sentence embedding in encoder-only transformer-based models, namely BERT, RoBERTa, and DistilBERT." [gal30b+] 🤖 #CL #AI

⚙️ https://github.com/AlaFalaki/AttentionVisualizer
🔗 https://arxiv.org/abs/2308.14850v1 #arxiv

GitHub - AlaFalaki/AttentionVisualizer: A simple library to showcase highest scored words using RoBERTa model

A simple library to showcase highest scored words using RoBERTa model - GitHub - AlaFalaki/AttentionVisualizer: A simple library to showcase highest scored words using RoBERTa model

GitHub

📝 Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models 📚👾

"Recursively generates summaries/memory using large language models (LLMs) to enhance long-term memory ability by stimulating LLMs to memorize small dialogue contexts and then recursively producing new memory using previous memory and following contexts." [gal30b+] 🤖 #CL #AI

🔗 https://arxiv.org/abs/2308.15022v1 #arxiv

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

Recently, large language models (LLMs), such as GPT-4, stand out remarkable conversational abilities, enabling them to engage in dynamic and contextually relevant dialogues across a wide range of topics. However, given a long conversation, these chatbots fail to recall past information and tend to generate inconsistent responses. To address this, we propose to recursively generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability. Specifically, our method first stimulates LLMs to memorize small dialogue contexts and then recursively produce new memory using previous memory and following contexts. Finally, the chatbot can easily generate a highly consistent response with the help of the latest memory. We evaluate our method on both open and closed LLMs, and the experiments on the widely-used public dataset show that our method can generate more consistent responses in a long-context conversation. Also, we show that our strategy could nicely complement both long-context (e.g., 8K and 16K) and retrieval-enhanced LLMs, bringing further long-term dialogue performance. Notably, our method is a potential solution to enable the LLM to model the extremely long context. The code and scripts will be released later.

arXiv.org
#AGIComics discovers why philosophers leave a vacuum when it comes to consciousness...

#OpenAI IP block ranges if you want to block them from your instance and scraping your content. I saw Mastodon devs added something to block #GPTBot via robots.txt a few days ago. Here are the IP ranges:

#MastoAdmin #FediBlock

20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

https://openai.com/gptbot-ranges.txt

https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai

https://github.com/mastodon/mastodon/pull/26396

Addendum 1

Theory for Emergence of Complex Skills in Language Models
https://arxiv.org/abs/2307.15936

* new skills emerge in language models when their parameter set, training corpora are scaled up
* poorly understood phenomenon; mathematical analysis of gradient-based training difficult
* paper analyzes emergence using scaling laws & simple statistical framework
* mathematical analysis imply strong form of inductive bias that allows pre-trained model to learn very efficiently

#LLM #emergence

A Theory for Emergence of Complex Skills in Language Models

A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.

arXiv.org
Slide 2 of our Brief Timeline for (Large) #LanguageModels from the last #ise2023 lecture introduced us to #ELIZA, Joseph Weizenbaum's simple #Chatbot from 1966 that simulates a conversation with a psychoanalyst. Weizenbaum was shocked that some persons including his secretary attributed human-like feelings to the computer program...
Slides: https://drive.google.com/file/d/1atNvMYNkeKDwXP3olHXzloa09S5pzjXb/view?usp=drive_link
#nlp #ai #llm #artificialintelligence @fizise
ISE2023 - 13 - ISE Applications.pdf

Google Docs