Aran Komatsuzaki

953 Followers
0 Following
921 Posts
ML research @ GaTech, DuckAI, EleutherAI, LAION.

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens

repo:… https://twitter.com/i/web/status/1674590148812521474

Aran Komatsuzaki on Twitter

“OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens repo:…”

Twitter

LLaVAR: Enhanced Visual Instruction Tuning
for Text-rich Image Understanding

Substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement)

proj: https://llavar.github.io/
abs: https://arxiv.org/abs/2306.17107

LLaVAR

Enhanced Visual Instruction Tuning

Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision

Significant improvement for off-the-shelf LLMs, boosting GPT-4 results past SotA supervised results on challenging evaluation datasets

https://arxiv.org/abs/2306.16564

Pareto Optimal Learning for Estimating Large Language Model Errors

Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information. We prove theoretically that the error estimator optimized in our framework aligns with the LLM and the information sources in an Pareto optimal manner. Experimental results show that the risk scores estimated by our method are well correlated with the true LLM error rate, thus facilitating error correction. By dynamically combining with prompting strategies such as self-verification and information retrieval, we demonstrate the proposed method can be utilized to increase the performance of an LLM, surpassing state-of-the-art task specific models.

arXiv.org

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

Presents a novel method for generating high-quality images directly from brain EEG signals, without the need to translate thoughts into text.

https://arxiv.org/abs/2306.16934

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost ``thoughts-to-image'', with potential applications in neuroscience and computer vision.

arXiv.org

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Presents a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts.

proj: https://neuralcarver.github.io/michelangelo/… https://twitter.com/i/web/status/1674579678047227908

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

- HyenaDNA scales sub-quadratically in sequence length
- Reaches SotA on 12 of 17 datasets using a model with orders of magnitude less parameters and pretraining data

https://arxiv.org/abs/2306.15794

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

arXiv.org

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Finds that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever.

https://arxiv.org/abs/2306.16410

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

arXiv.org

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Develops a quantitative framework to evaluate whose opinions model-generated responses are more similar to.

proj: https://llmglobalvalues.anthropic.com/
data: https://huggingface.co/datasets/Anthropic/llm_global_opinions
abs:… https://twitter.com/i/web/status/1674218677678362624

Subjective Values

Interactive map

Extending Context Window of Large Language Models via Positional Interpolation

Presents Position Interpolation (PI) that extends the context window sizes of LLaMA to up to 32k with minimal fine-tuning (within 1000 steps)

https://arxiv.org/abs//2306.15595

Extending Context Window of Large Language Models via Positional Interpolation

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

arXiv.org

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Provides the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

proj: https://leandojo.org/
repo:… https://twitter.com/i/web/status/1673857618757001219

AI-Driven Formal Theorem Proving in the Lean Ecosystem