Adam Auton

@adamauton
105 Followers
76 Following
5 Posts

In the latest edition of "Huh, I didn't expect that to work", our latest paper shows that LLMs can outperform existing methods for identifying causal genes in genome-wide association studies.

https://www.medrxiv.org/content/10.1101/2024.05.30.24308179v1

Large language models identify causal genes in complex trait GWAS

Identifying underlying causal genes at significant loci from genome-wide association studies (GWAS) remains a challenging task. Literature evidence for disease-gene co-occurrence, whether through automated approaches or human expert annotation, is one way of nominating causal genes at GWAS loci. However, current automated approaches are limited in accuracy and generalizability, and expert annotation is not scalable to hundreds of thousands of significant findings. Here, we demonstrate that large language models (LLMs) can accurately identify genes likely to be causal at loci from GWAS. By evaluating the performance of GPT-3.5 and GPT-4 on datasets of GWAS loci with high-confidence causal gene annotations, we show that these models outperform state-of-the-art methods in identifying putative causal genes. These findings highlight the potential of LLMs to augment existing approaches to causal gene discovery. ### Competing Interest Statement S.S., W.W., S.K., X.W., A.R., A.A., A.K., are employed by and hold stock or stock options in 23andMe, Inc. ### Funding Statement This study did not receive any funding ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: All source data were openly available. Download links: * OpenTargets - https://github.com/opentargets/genetics-gold-standards/ * Pharmaprojects - https://github.com/ericminikel/genetic_support * Weeks et al. - https://www.finucanelab.org/data * GWAS Catalog - https://www.ebi.ac.uk/gwas/docs/file-downloads I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable. Yes We will share all processed datasets used in our analysis, as well as the prediction results from all methods on all datasets, intermediate outputs like gene and phenotype embeddings using Zenodo (doi: 10.5281/zenodo.11391053).

medRxiv
Delighted to share our latest preprint. We identified a rare, large-effect coding variant in the Puerto Rican population, contributing to cataract risk, and underscoring the importance of diverse genetic studies. #Genomics #RareVariants #23andme
https://www.medrxiv.org/content/10.1101/2023.07.25.23293173v1
GWAS of cataract in Puerto Ricans identifies a novel large-effect variant in ITGA6

Cataract is a common cause of vision loss and affects millions of people worldwide. Genome-wide association studies (GWAS) and family studies of cataract have demonstrated a role for genetics in cataract susceptibility. However, most of these studies have been conducted in populations of European or Asian descent, leaving the genetic etiology of cataract among Hispanic/Latino (HL) populations unclear. Here we perform the first GWAS of cataract in a Puerto Rican population of research participants derived from the customer base of 23andMe, Inc. In our analysis with 3,060 self-reported cases and 41,890 controls, we found a novel association of large effect size with a rare coding variant in the ITGA6 gene (rs200560853, p-value=2.9*10^(-12), OR=12.7, 95% CI=[6.5, 24.7]). ITGA6 is part of the integrin alpha chain in the laminin receptor subfamily, and likely contributes to eye lens homeostasis, transparency, and cell survival. We found that this coding variant is associated with a 13.7 year earlier disease onset on average, as well as a 4.3-fold higher rate of cataract events in the Puerto Rican population. The variant has a minor allele frequency (MAF) of 0.089% in Puerto Rico and is extremely rare elsewhere in the world. Population genetic analyses showed that the variant is only found in individuals with ancestry from the Americas and countries bordering the Mediterranean Sea, suggesting a North African origin. Our discovery identifies a novel genetic risk factor for cataract in Puerto Ricans and highlights the importance of including underrepresented populations in genomics research to improve our understanding of disease in all populations. ### Competing Interest Statement All authors are employed by and hold stock or stock options in 23andMe, Inc. ### Funding Statement This study did not receive any funding. ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: Eligible research participants were selected from the 23andMe customer base who provided informed consent and volunteered to participate in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent (E&I) Review Services. As of 2022, E&I Review Services is part of Salus IRB (https://www.versiticlinicaltrials.org/salusirb). Inclusion criteria of the research participants was based on their consent status at the time when the data analysis was initiated. I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable. Yes The full GWAS summary statistics for the 23andMe discovery data set will be made available through 23andMe to qualified researchers under an agreement with 23andMe that protects the privacy of the 23andMe participants. Datasets will be made available at no cost for academic use. Please visit https://research.23andme.com/collaborate/#dataset-access/ for more information and to apply to access the data.

medRxiv
I’m trying toot! So far so good.
What are people’s favorite iOS clients for using this site? The default one seems a little clunky.
So, folks seem to have done a good job importing their Tw*tter networks… may I ask how?