Aran Komatsuzaki

953 Followers
0 Following
921 Posts
ML research @ GaTech, DuckAI, EleutherAI, LAION.
LLaVA

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens

repo:… https://twitter.com/i/web/status/1674590148812521474

Aran Komatsuzaki on Twitter

“OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens repo:…”

Twitter

LLaVAR: Enhanced Visual Instruction Tuning
for Text-rich Image Understanding

Substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement)

proj: https://llavar.github.io/
abs: https://arxiv.org/abs/2306.17107

LLaVAR

Enhanced Visual Instruction Tuning

Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision

Significant improvement for off-the-shelf LLMs, boosting GPT-4 results past SotA supervised results on challenging evaluation datasets

https://arxiv.org/abs/2306.16564

Pareto Optimal Learning for Estimating Large Language Model Errors

Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information. We prove theoretically that the error estimator optimized in our framework aligns with the LLM and the information sources in an Pareto optimal manner. Experimental results show that the risk scores estimated by our method are well correlated with the true LLM error rate, thus facilitating error correction. By dynamically combining with prompting strategies such as self-verification and information retrieval, we demonstrate the proposed method can be utilized to increase the performance of an LLM, surpassing state-of-the-art task specific models.

arXiv.org

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

Presents a novel method for generating high-quality images directly from brain EEG signals, without the need to translate thoughts into text.

https://arxiv.org/abs/2306.16934

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost ``thoughts-to-image'', with potential applications in neuroscience and computer vision.

arXiv.org

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Presents a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts.

proj: https://neuralcarver.github.io/michelangelo/… https://twitter.com/i/web/status/1674579678047227908

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

- HyenaDNA scales sub-quadratically in sequence length
- Reaches SotA on 12 of 17 datasets using a model with orders of magnitude less parameters and pretraining data

https://arxiv.org/abs/2306.15794

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

arXiv.org

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Finds that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever.

https://arxiv.org/abs/2306.16410

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

arXiv.org

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Develops a quantitative framework to evaluate whose opinions model-generated responses are more similar to.

proj: https://llmglobalvalues.anthropic.com/
data: https://huggingface.co/datasets/Anthropic/llm_global_opinions
abs:… https://twitter.com/i/web/status/1674218677678362624

Subjective Values

Interactive map

Should I request $250/hr for consulting?