Martin Steinegger

386 Followers
106 Following
60 Posts
Developing data intensive computational methods β€’ PI @ Seoul National University πŸ‡°πŸ‡· β€’ #FirstGen β€’ #newPI β€’ he/him β€’ HauptschΓΌler β€’ @thesteinegger on Twitter
Web 🌐https://steineggerlab.com
Github πŸ’Ύhttps://github.com/martin-steinegger
Twitter 🐦@thesteinegger

Structural motif search across the protein-universe with Folddisco

https://www.biorxiv.org/content/10.1101/2025.07.06.663357v1

I was just wondering the other day if there was already a tool to do this :) @martinsteinegger

@bioinformatics @strucbio #Bioinformatics

Structural motif search across the protein-universe with Folddisco

Detecting similar protein structural motifs, functionally crucial short 3D patterns, in large structure collections is computationally prohibitive. Therefore, we developed Folddisco, which overcomes this through an index of position-independent geometric features, including side-chain orientation, combined with a rarity-based scoring system. Folddisco indexes 53 million AFDB50 structures into 1.45 terabyte within 24 hours, enabling rapid detection of discontinuous or segment motifs. Folddisco is more accurate and storage-efficient than state-of-the-art methods, while being an order of magnitude faster. Folddisco is free software available at folddisco.foldseek.com and a webserver at https://search.foldseek.com/folddisco. ### Competing Interest Statement M.S. acknowledges outside interest in Stylus Medicine. The remaining authors declare no competing interests. National Research Foundation of Korea, https://ror.org/013aysd81, 2020M3A9G7103933, RS-2021-NR061659, RS-2021-NR056571, RS-2024-00396026, RS-2023-00250470 Novo Nordisk Foundation, https://ror.org/04txyc737, NNF24SA0092560

bioRxiv
Metagenomic-scale analysis of the predicted protein structure universe https://www.biorxiv.org/content/10.1101/2025.04.23.650224v1?med=mas
Metagenomic-scale analysis of the predicted protein structure universe

Protein structure prediction breakthroughs, notably AlphaFold2 and ESMfold, have led to an unprecedented influx of computationally derived structures. The AlphaFold Protein Structure Database now provides over 200 million models, while the ESM Metagenomic Atlas includes more than 600 million predictions from uncultured microbes. Here, we combine these two resources into the AFESM, an 821-million-entry dataset, and cluster them using a two-step pipeline based on sequence and structure similarity, yielding 5.12 million non-singleton structural clusters. We identify common ancestors and biomes for these clusters to explore their environmental diversity and specificity, and we investigate their domain composition for structural novelties. Initial ESMfold-based predictions revealed no novel domain folds, re-predicting 2.3 million proteins with AlphaFold2 yielded only one new fold, suggesting near-saturation of the domain space and limitations of predictors. Nevertheless, we discovered many previously unseen domain combinations, highlighting how ESMatlas expands coverage of the known protein fold space. In particular, we find 11,941 multi-domain architectures not observed before, underscoring the importance of metagenomic data for illuminating underexplored regions of the protein structural universe. Availability An interactive webserver and data are available at afesm.foldseek.com. ### Competing Interest Statement M.S. declares an outside interest in Stylus Medicine.

bioRxiv
GPU-accelerated homology search with MMseqs2 http://biorxiv.org/cgi/content/short/2024.11.13.623350v1?rss=1
Multiple Protein Structure Alignment at Scale with FoldMason https://www.biorxiv.org/content/10.1101/2024.08.01.606130v1?med=mas
FoldMason progressively aligns thousands of protein structures in seconds, enabling remote MSA for distant phylogeny. Highlights: structural flexible MSA, LDDT conservation score, friendly webserver.
πŸ’Ύ https://github.com/steineggerlab/foldmason
🌐 https://search.foldseek.com/foldmason
πŸ“„ https://www.biorxiv.org/content/10.1101/2024.08.01.606130v1
GitHub - steineggerlab/foldmason: Multiple Protein Structure Alignment at Scale with FoldMason

Multiple Protein Structure Alignment at Scale with FoldMason - steineggerlab/foldmason

GitHub

Each year, the Overton Prize is awarded to a scientist for their significant contributions to computational biology. This year, the International Society for Computational Biology (ISCB) has the pleasure of honoring Dr Martin Steinegger with this award at the 32nd Annual Intelligent Systems for Molecular Biology (ISMB) conference being held in Montreal, Quebec, Canada from July 12 to 16.

https://academic.oup.com/bioinformatics/article/40/Supplement_1/i3/7700861

Very well deserved @martinsteinegger πŸ˜ƒ

Foldseek-Multimer is a protein complex aligner that is up to 10,000x times faster than SOTA methods without sacrificing quality, enabling the comparison of billions of complex pairs per day.

πŸ“„ https://www.biorxiv.org/content/10.1101/2024.04.14.589414v1
πŸ’Ύ https://github.com/steineggerlab/foldseek
πŸ•ΈοΈ https://search.foldseek.com

Penguin is our new assembler that reconstructs manyfold more accurate strain-level viral genomes and 16S rRNAs from metagenomes through a novel greedy AA/DNA-hybrid bayesian overlap extension strategy. Work by Annika Jochheim et al.
πŸ“„ https://www.biorxiv.org/content/10.1101/2024.03.29.587318v1
πŸ’Ύ https://github.com/soedinglab/plass

✨ New preprint!!! ✨
'A comprehensive evaluation of taxonomic classifiers in marine vertebrate eDNA studies'

https://www.biorxiv.org/content/10.1101/2024.02.15.580601v1

We realised there are not many papers evaluating taxonomic classifiers in marine vertebrate contexts, and none that use simulations or exclusion databases to measure false positives.

So we designed a bunch of simulations for commonly used primers of 12S, 16S, COI, and evaluated a whole bunch of classifiers on these datasets.

In conclusion: we propose MMseqs2 or Metabuli with 12S/16S and MMSeqs2 or a Naive Bayes classifier (i.e., Mothur) with COI, you can get up to ~10% better correct species identification.