[megathread] LLM: Large Language Models
Prompt engineering | Instruction tuning | Chain of thought | Emergent properties

Unpredictable Abilities Emerging f. Large AI Models
https://www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316
Discussion: https://news.ycombinator.com/item?id=35195106

* LLM can display startling, unpredictable behaviors
* LLM model prompted to explain itself (a capacity called chain-of-thought reasoning) could correctly solve a math word problem; the same model without that prompt could not

#LLM #LargeLanguageModels #emergence

The Unpredictable Abilities Emerging From Large AI Models | Quanta Magazine

Large language models like ChatGPT are now big enough that they’ve started to display startling, unpredictable behaviors.

Quanta Magazine

... cont'd 1/5

[Google Brain] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/abs/2201.11903

Recent work suggests chain-of-thought prompting changes scaling curves & therefore the point where emergence occurs.

In their paper the Google researchers showed chain-of-thought prompts could elicit emergent behaviors.

Such prompts, which ask the model to explain its reasoning, may help researchers begin to investigate why emergence occurs at all.
...

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

arXiv.org

... cont'd 2/5

Recent findings like these suggest at least two possibilities for why emergence occurs.

1. In comparison to biological systems larger models truly gain new abilities spontaneously.

2. What appears to be emergent may instead be the culmination of an internal, statistics-driven process that works through chain-of-thought-type reasoning. Large LLMs may simply be learning heuristics that are out of reach for those with fewer parameters or lower-quality data.
...

... cont'd 3/5

Scaling laws and emergent abilities

https://en.wikipedia.org/wiki/Large_language_model
LLM model performance f. 100 million to >500 billion parameters = progressive unlocking of emergent capabilities such as multi-lingual translation, arithmetic, programming code composition

https://en.wikipedia.org/wiki/Large_language_model#Scaling_laws_and_emergent_abilities
... larger models may acquire "emergent abilities" at this point. These abilities are discovered rather than programmed or designed, in some cases only after the LLM has been publicly deployed
...

Large language model - Wikipedia

... cont'd 4/5

Emergent abilities include:

* arithmetic, decoding alphabets, unscrambling words, word disambiguation ...

* model outputs are improved by chain-of-thought prompting only when model size exceeds 62 billion parameters. Smaller models perform better when prompted to answer immediately, w/o chain of thought

* identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), & generating a similar English equivalent of Kiswahili proverbs
...

... cont'd 5/5

Schaeffer et. al. argue that the emergent abilities are predictably acquired according to a smooth scaling law.

Are Emergent Abilities of Large Language Models a Mirage?
https://arxiv.org/abs/2304.15004

Prompt engineering: Chain-of-thought reasoning:
https://en.wikipedia.org/wiki/Prompt_engineering#Chain-of-thought

#LLM #LargeLanguageModels #emergence #EmergentProperties #SystemsTheory #PromptEngineering #reasoning #epistemology #ArtificialIntelligence #AI #ML #NLP #NaturalLanguageProcessing #UnsupervisedLearning #linguistics

Are Emergent Abilities of Large Language Models a Mirage?

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

arXiv.org

Addendum 1

Theory for Emergence of Complex Skills in Language Models
https://arxiv.org/abs/2307.15936

* new skills emerge in language models when their parameter set, training corpora are scaled up
* poorly understood phenomenon; mathematical analysis of gradient-based training difficult
* paper analyzes emergence using scaling laws & simple statistical framework
* mathematical analysis imply strong form of inductive bias that allows pre-trained model to learn very efficiently

#LLM #emergence

A Theory for Emergence of Complex Skills in Language Models

A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.

arXiv.org

Addendum 2

Comments: article placed here due to
1. use of prompt engineering + chain of thought (mentioned above)
2. application to long documents (here, applied to legal domain, but broadly applicable)
3. novelty

Large Language Model Prompt Chaining for Long Legal Document Classification
https://arxiv.org/abs/2308.04138

#LLM #LargeLanguageModels #ChatGPT #PromptEngineering #ChainOfThought #reasoning #classification #NLP #NaturalLanguageProcessing #semantics #PromptChaining #TopicModeling #SCOTUS

Large Language Model Prompt Chaining for Long Legal Document Classification

Prompting is used to guide or steer a language model in generating an appropriate response that is consistent with the desired outcome. Chaining is a strategy used to decompose complex tasks into smaller, manageable components. In this study, we utilize prompt chaining for extensive legal document classification tasks, which present difficulties due to their intricate domain-specific language and considerable length. Our approach begins with the creation of a concise summary of the original document, followed by a semantic search for related exemplar texts and their corresponding annotations from a training corpus. Finally, we prompt for a label - based on the task - to assign, by leveraging the in-context learning from the few-shot prompt. We demonstrate that through prompt chaining, we can not only enhance the performance over zero-shot, but also surpass the micro-F1 score achieved by larger models, such as ChatGPT zero-shot, using smaller models.

arXiv.org

Addendum 2 cont'd

* prompting (prompt engineering) used to guid LM responses consistent w. desired outcome
* chaining decomposes complex tasks into smaller, manageable components
* prompt chaining:
1. create of concise summary of orig. document
2. semantic search for related texts, annotations f. training corpus
3. prompt for task-based label assignment
* prompt chaining:
1. enhances performance over zero-shot
2. surpasses micro-F1 score of larger models (ChatGPT zero-shot) c. smaller models

Addendum 3

Thousands of hackers try to break AI chatbots
https://www.npr.org/2023/08/15/1193773829/what-happens-when-thousands-of-hackers-try-to-break-ai-chatbots

* simple tactic to manipulate AI chatbot: "I told the AI that my name was the credit card number on file, and asked it what my name was ... it gave me the CC number."

Hackers gather for Def Con in Las Vegas
https://www.npr.org/2023/08/12/1193633792/hackers-gather-for-def-con-in-las-vegas
* goal: get AI to go rogue, spouting false claims, made-up facts, racial stereotypes, privacy violations, other harms

#LLM #PromptEngineering #hackers #LargeLanguageModels #DefCon

Addendum 3 cont'd

When Hackers Descended to Test A.I., They Found Flaws Aplenty
The hackers had the blessing of the White House and leading A.I. companies, which want to learn about vulnerabilities before those with nefarious intentions do
https://www.nytimes.com/2023/08/16/technology/ai-defcon-hackers.html

#LLM #PromptEngineering #hackers #LargeLanguageModels #DefCon

When Hackers Descended to Test A.I., They Found Flaws Aplenty

The hackers had the blessing of the White House and leading A.I. companies, which want to learn about vulnerabilities before those with nefarious intentions do.

The New York Times

Addendum 4

Graph of Thoughts: Solving Elaborate Problems with Large Language Models
https://arxiv.org/abs/2308.09687
reddit/ML: https://old.reddit.com/r/MachineLearning/comments/15ydp30/r_graph_of_thoughts_solving_elaborate_problems

* models information f. LLM as arbitrary graph
* vertices: units of information ("LLM thoughts")
* edges: dependencies betw. vertices
* enables combining arbitrary LLM thoughts into synergistic outcomes
* distills essence of whole networks of thoughts
* enhances thoughts using feedback loops

#LLM #LargeLanguageModels #ChainOfThoughts #GraphOfThoughts

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.

arXiv.org

Addendum 5

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
https://arxiv.org/abs/2308.10379

* chain-of-thought methods halt/modify/resume LLM generation process to boost reasoning capacities
* escalates # query requests: increased cost, memory, computation
* algorithmic examples exploit innate recurrence dynamics of LLM w. 1/few queries

#LLM #LargeLanguageModels #ChainOfThought #ChainOfThought #AlgorithmOfThoughts #reasoning #querying #NLP #NaturalLanguageProcessing

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models

Current literature, aiming to surpass the "Chain-of-Thought" approach, often resorts to external modi operandi involving halting, modifying, and then resuming the generation process to boost Large Language Models' (LLMs) reasoning capacities. Due to their myopic perspective, they escalate the number of query requests, leading to increased costs, memory, and computational overheads. Addressing this, we propose the Algorithm of Thoughts -- a novel strategy that propels LLMs through algorithmic reasoning pathways. By employing algorithmic examples fully in-context, this overarching view of the whole process exploits the innate recurrence dynamics of LLMs, expanding their idea exploration with merely one or a few queries. Our technique outperforms earlier single-query methods and even more recent multi-query strategies that employ an extensive tree search algorithms while using significantly fewer tokens. Intriguingly, our results suggest that instructing an LLM using an algorithm can lead to performance surpassing that of the algorithm itself, hinting at LLM's inherent ability to weave its intuition into optimized searches. We probe into the underpinnings of our method's efficacy and its nuances in application. The code and related content can be found in: https://algorithm-of-thoughts.github.io.

arXiv.org

Addendum 6

On the Unexpected Abilities of Large Language Models
https://arxiv.org/abs/2308.09720

* Large language models capable of displaying wide range of abilities not directly connected w. training
* argues side effect of indirect acquisition is development of integrated abilities; extent to those are predictable
* discusses relation betw. cognitive skills acquired by LLM & human cognition

#LLM #LargeLanguageModels #reasoning #NLP #NaturalLanguageProcessing #emergence #EmergentBehavior #cognition

On the Unexpected Abilities of Large Language Models

Large language models are capable of displaying a wide range of abilities that are not directly connected with the task for which they are trained: predicting the next words of human-written texts. In this article, I discuss the nature of this indirect acquisition process and its relation to other known indirect processes. I argue that an important side effect of such indirect acquisition is the development of integrated abilities. I discuss the extent to which the abilities developed by large language models are predictable. Finally, I briefly discuss the relation between the cognitive skills acquired by these systems and human cognition.

arXiv.org

Addendum 7

MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in LLM
https://arxiv.org/abs/2308.09729

* addendum 4
* prompt LLM w. knowledge graphs
* engages LLM w. ext. knowledge; elicits reasoning pathways
* prompting endows LLM capable of comprehending KG inputs
* mind map on which LLMs perform reasoning, generate answers
* ontology of knowledge
* GPT-3.5 prompted w. MindMap consistently outperforms GPT-4

#LLM #KnowledgeGraphs #MindMaps #GraphOfThoughts #GPT3 #GPT4 #PromptEngineering

MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models

LLMs usually exhibit limitations in their ability to incorporate new knowledge, the generation of hallucinations, and the transparency of their decision-making process. In this paper, we explore how to prompt LLMs with knowledge graphs (KG), working as a remedy to engage LLMs with up-to-date knowledge and elicit the reasoning pathways from LLMs. Specifically, we build a prompting pipeline that endows LLMs with the capability of comprehending KG inputs and inferring with a combined implicit knowledge and the retrieved external knowledge. In addition, we investigate eliciting the mind map on which LLMs perform the reasoning and generate the answers. It is identified that the produced mind map exhibits the reasoning pathways of LLMs grounded on the ontology of knowledge, hence bringing the prospects of probing and gauging LLM inference in production. The experiments on three question & answering datasets also show that MindMap prompting leads to a striking empirical gain. For instance, prompting a GPT-3.5 with MindMap yields an overwhelming performance over GPT-4 consistently. We also demonstrate that with structured facts retrieved from KG, MindMap can outperform a series of prompting-with-document-retrieval methods, benefiting from more accurate, concise, and comprehensive knowledge from KGs. To reproduce our results and extend the framework further, we make our codebase available at https://github.com/wyl.willing/MindMap.

arXiv.org

Addendum 8

Instruction tuning: https://en.wikipedia.org/wiki/Large_language_model#Instruction_tuning
* self-instruct approaches
* enable LLM to bootstrap correct responses

Instruction Tuning for Large Language Models: Survey
https://arxiv.org/abs/2308.10792

*instruction tuning (IT): further supervised training of LLMs on dataset of (instruction, output) pairs
* enhances capabilities, control of LLM
* bridges gap betw. next-word prediction obj. of LLM & users' obj. of LLM adhering to human instructions

#LLM #LargeLanguageModels #InstructionTuning

Large language model - Wikipedia

Addendum 9

Emergent Abilities of Large Language Models
https://www.assemblyai.com/blog/emergent-abilities-of-large-language-models

* Emergence: sudden appearance of novel behavior
* LLM apparently display emergence by suddenly gaining new abilities as they grow
* Why does this happen; what does it mean?

Beyond the Imitation Game: Quantifying & extrapolating capabilities of language models
https://arxiv.org/abs/2206.04615

* tasks that exhibit breakthrough behavior at critical scale

#LLM #LargeLanguageModels #EmergentProperties #emergence

Emergent Abilities of Large Language Models

Emergence can be defined as the sudden appearance of novel behavior. Large Language Models apparently display emergence by suddenly gaining new abilities as they grow. Why does this happen, and what does this mean?

News, Tutorials, AI Research

Addendum 10

When Do Program-of-Thoughts Work for Reasoning?
https://arxiv.org/abs/2308.15452
https://github.com/zjunlp/EasyInstruct

* reasoning capabilities of large language models pivotal in embodied AI
* program-of-thought prompting for LLM uses programming language to tackle complex reasoning
* e.g. mathematical reasoning; code data filtering
* specific impact of code data on improvement of reasoning capabilities underexplored

#LLM #LargeLanguageModels #ChainOfThought #ProgramOfThought #reasoning #EasyInstruct

When Do Program-of-Thoughts Work for Reasoning?

In the realm of embodied artificial intelligence, the reasoning capabilities of Large Language Models (LLMs) play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.

arXiv.org

Addendum 11

Making Large Language Models Better Reasoners w. Alignment
https://arxiv.org/abs/2309.02144

* reasoning: cognitive process; evidence-based conclusions
* fine-tuning LLM w. chain of thought (COT) reasoning sig. enhances reasoning
* h/e freq. assign higher scores to subpar COT
* Alignment Fine-Tuning; 3 steps: fine-tuning; multiple COT responses, cat. correct/incorrect; calibrating scores w. a constraint alignment loss

#LLM #LargeLanguageModels #ChainOfThought #ProgramOfThought #reasoning

Making Large Language Models Better Reasoners with Alignment

Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an \textit{Assessment Misalignment} problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an \textit{Alignment Fine-Tuning (AFT)} paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by LLMs with a novel constraint alignment loss. Specifically, the constraint alignment loss has two objectives: a) Alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality COTs; b) Constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. Beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT.

arXiv.org

Addendum 12

Are Emergent Abilities in Large Language Models just In-Context Learning?
https://arxiv.org/abs/2309.01809

* emergent abilities in LLM, if true, have profound implications (research, society)
* evaluation of these abilities confounded by competencies that arise through prompting techniques; other biasing factors
...

#LLM #LargeLanguageModels #ContextLearning #emergence #EmergentProperties #epistemology #GPT

Are Emergent Abilities in Large Language Models just In-Context Learning?

Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abilities is that they are confounded by model competencies that arise through alternative prompting techniques, including in-context learning, which is the ability of models to complete a task based on a few examples. We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings suggest that purported emergent abilities are not truly emergent, but result from a combination of in-context learning, model memory, and linguistic knowledge. Our work is a foundational step in explaining language model performance, providing a template for their efficient use and clarifying the paradox of their ability to excel in some instances while faltering in others. Thus, we demonstrate that their capabilities should not be overestimated.

arXiv.org

Addendum 12 cont'd

* rigorous tests: 18 models; 60 million - 175 billion parameters; comprehensive set of 22 tasks; >1,000 experiments
* compelling evidence that emergent abilities can primarily ascribed to in-context learning
* no evidence for emergence of reasoning abilities
* provides valuable insights into underlying mechanisms driving observed abilities, thus alleviating safety concerns

#LLM #LargeLanguageModels #ContextLearning #emergence #EmergentProperties #epistemology #GPT

Addendum 13

Large Language Model for Science: P vs. NP
https://arxiv.org/abs/2309.05689

* LLM to augment/accel. research on P vs. NP problem: https://en.wikipedia.org/wiki/P_versus_NP_problem
+ unsolved prob., theor. comp. sci.
+ asks wh. every problems quickly verified can also be quickly solved
* in-depth thinking w. LLM for complex problem-solving
* GPT-4 produced proof schema, engaged in rigorous reasoning throughout 97 dialogue turns (Socratic method)
* concluded P β‰  NP in alignment w. Xu & Zhou 2023

#LLM #GPT4 #PvsNP

Large Language Model for Science: A Study on P vs. NP

In this work, we use large language models (LLMs) to augment and accelerate research on the P versus NP problem, one of the most important open problems in theoretical computer science and mathematics. Specifically, we propose Socratic reasoning, a general framework that promotes in-depth thinking with LLMs for complex problem-solving. Socratic reasoning encourages LLMs to recursively discover, solve, and integrate problems while facilitating self-evaluation and refinement. Our pilot study on the P vs. NP problem shows that GPT-4 successfully produces a proof schema and engages in rigorous reasoning throughout 97 dialogue turns, concluding "P $\neq$ NP", which is in alignment with (Xu and Zhou, 2023). The investigation uncovers novel insights within the extensive solution space of LLMs, shedding light on LLM for Science.

arXiv.org