Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).
Hyena Hierarchy: Towards Larger Convolutional Language Models
https://arxiv.org/abs/2302.10866
They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.
#DeepLearning #LLMs #PaperThread
1/4
Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.
5/5
Our dataset comprises also CT and MRI scans with patients lesions segmented by an expert.
This allowed us to look at the distribution of lesions cluster-wise, and validate the associations between symptoms and lesions.
Check our pre-print and comment, make questions, offer suggestions!
Although it is not simple to share data, we will release code soon, as a means to replicate the approach on similar data and more.
The link is already in the paper!
And let us know if you have data you'd like to share and analyse with our developing methods👨🏾💻
We are deciding on the best match for a journal to review and possibly publish this work, of which I am super proud and thankful to co-authors Andrea Zanola, Antonio Bisogno, Silvia Facchini, Lorenzo Pini, Manfredo Atzori, and Maurizio Corbetta!
#scicomm #paperthread #preprints #neuroscience #machinelearning #mri #stroke #clustering
1/n
Our pre-print is finally out!
Here's my first #paperthread 🧵
In this work, co-authors and I clustered ischaemic stroke patients profiles, and recovered common patterns of cognitive, sensorimotor damage.
...Historically many focal lesions to specific cortical areas were associated with specific distinction, but most strokes involve subcortical regions and bring multivariate patterns of deficits.
To characterize those patterns, many studies have turned to correlation analysis, factor analysis, PCA, focusing on the relations among variables==domains of impairments...
BACKGROUND Stroke is one of the leading causes of death and disability. The resulting behavioral deficits can be measured with clinical scales of motor, sensory, and cognitive impairment. The most common of such scales is the National Institutes of Health Stroke Scale, or NIHSS. Computerized tomography (CT) and magnetic resonance imaging (MRI) scans show predominantly subcortical or subcortical-cortical lesions, with pure cortical lesions occurring less frequently. While many experimental studies have correlated specific deficits (e.g. motor or language impairment) with stroke lesion locations, the mapping between symptoms and lesions is not straightforward in clinical practice. The advancement of machine learning and data science in recent years has shown unprecedented opportunities even in the biomedical domain. Nevertheless, their application to medicine is not simple, and the development of data driven methods to learn general mathematical models of diseases from healthcare data is still an unsolved challenge. METHODS In this paper we measure statistical similarities of stroke patients based on their NIHSS scores, and we aggregate symptoms profiles through two different unsupervised machine learning techniques: spectral clustering and affinity propagation. RESULTS We identify clusters of patients with largely overlapping, coherent lesions, based on the similarity of behavioral profiles. CONCLUSIONS Overall, we show that an unsupervised learning workflow, open source and transferable to other conditions, can identify coherent mathematical representations of stroke lesions based only on NIHSS data. ### Competing Interest Statement The authors have declared no competing interest. ### Funding Statement This work was supported by the Department of excellence 2018-2022 initiative of the Italian Ministry of education (MIUR) awarded to the Department of Neuroscience-University of Padua. ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: For data of patients of the Saint Louis cohort: the Internal Review Board of Washington University School of Medicine (WUSM) gave ethical approval for this work. For data of patients of the Padua cohort: the Ethics Committee of the Azienda Ospedale Università Padova (AOUP) gave ethical approval for this work. I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable. Yes Data can be made available upon reasonable request to Maurizio Corbettta at maurizio.corbetta{at}unipd.it. * AP : Affinity Propagation. GDM : General Distance Measure. GSM : General Similarity Measure. NIHSS : National Institutes of Health Stroke Scale. RSC : Repeated Spectral Clustering.
📝 Now reading: "From empirical problem-solving to theoretical problem-finding perspectives on the cognitive sciences -- by @fedeadolfi #LauraVandeBraak, and @mariekewoe (2023, PsyArXiv) #PaperThread 🧵