David Schneider-Joseph

74 Followers
250 Following
90 Posts

Math, ML, AI alignment, AI timelines.

Former guidance and navigation software for first stage landing @ SpaceX. Before that: AWS and Google.

Some interesting examples of GPT-4's abilities.

In the first, it uses a Linux terminal to iteratively problem-solve locating and intruding into a poorly-secured machine on a local network.

In the second, it navigates through a text-based environment, after which it accurately draws a map of the rooms it walked through.

From Bubeck et al (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4: https://arxiv.org/abs/2303.12712

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

arXiv.org

A few minutes ago in a parking lot, a car very slowly backed up towards the car I was sitting in, while my driver saw it coming and blared his horn.

Each second the probability of impact increased, until the impact occurred.

This is basic stuff.

Martingale ≠ random walk.

Here’s a fun game: identify all the problems with this graphic.

In which Bing asks appropriate clarifying questions to solve a complex puzzle, then works through to the (nearly) correct answer step by step.

(It ignored my admonition not to search the web, but I don't think this puzzle exists on the web.

The puzzle was invented by Alexander Cohen.)

I had Bing write a short sci-fi story. This was its first attempt. Just when it was getting good, the content filter kicked in and the output was replaced with:

"Sorry, I don't know how to discuss this topic. You can try http://bing.com for more information.
Fun fact, did you know only a quarter of the Sahara Desert is sandy"

Prompted by https://twitter.com/BeijingPalmer/status/1626704487434919943

Die Höhle für den Winter

Die Eben Ice Caves gibt es nur in den eisigen Wint

Bing

This is a theory of mind puzzle that ChatGPT consistently gets wrong and Bing consistently gets right (I verified myself). Pretty impressive.

Source: https://twitter.com/sir_deenicus/status/1626732776639561730

Deen Kun A. on Twitter

“This is a theory of mind puzzle I just tried from Gary Marcus's blog that ChatGPT consistently fails. And as I suspected, Bing's model is better at modeling this kind of stuff. They still keep getting better. It's why I don't dismiss LLMs”

Twitter

Z. Zhang et al (2023). Multimodal Chain-of-Thought Reasoning in Language Models

https://arxiv.org/abs/2302.00923

Multimodal Chain-of-Thought Reasoning in Language Models

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at https://github.com/amazon-science/mm-cot.

arXiv.org
Figures from Kosinski (2023). Theory of Mind May Have Spontaneously Emerged in Large Language Models: https://arxiv.org/abs/2302.02083
Theory of Mind Might Have Spontaneously Emerged in Large Language Models

We explore the intriguing possibility that theory of mind (ToM), or the uniquely human ability to impute unobservable mental states to others, might have spontaneously emerged in large language models (LLMs). We designed 40 false-belief tasks, considered a gold standard in testing ToM in humans, and administered them to several LLMs. Each task included a false-belief scenario, three closely matched true-belief controls, and the reversed versions of all four. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. These findings suggest the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills.

arXiv.org