#RobSelects preprint of the week
#ChemRxiv: Training and finetuning chemical foundation models applied to developing open-source MoLFormer models.
#aichem https://doi.org/10.26434/chemrxiv-2025-4glrl-v2 
ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models
The rapid advancement of machine learning in computational chemistry has opened new doors for designing molecules, predicting molecular properties, and discovering novel materials. However, building scalable and robust models for molecular property prediction remains a significant challenge due to the vast size and complexity of chemical space. In this paper, we introduce ChemBERTa-3, an open-source training framework designed to train and fine-tune large-scale chemical foundation models. We explore the potential of multiple model architectures by evaluating their performance across various molecular datasets from the MoleculeNet suite. Our experiments demonstrated that pre-training on the expansive ZINC20 dataset yields models capable of performing well on both classification and regression tasks, providing valuable insights into drug discovery and materials science. For scalability, we leveraged both AWS-based Ray deployments and on-premise high-performance computing clusters to support the processing power required to train on billions of molecules. In support of reproducible and extensible science, we have open-sourced all ChemBERTa3 models.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: New experimental datasets and an improved machine learning model for predicting carbon-13 nuclear magnetic resonance shifts of organic molecules.
#aichem https://doi.org/10.26434/chemrxiv-2025-r8m9m 
CASCADE-2.0: Real Time Prediction of 13C-NMR Shifts with sub-ppm Accuracy
Nuclear Magnetic Resonance (NMR) is one of the most powerful structural characterization techniques in molecular sciences. However, the complexity of NMR spectra can make structural assignments prone to er-rors. Here we introduce a deep learning model – CASCADE-2.0 (ChemicAl Shift CAlculation with DEep learn-ing), a practical tool designed to assist chemists in making fast, reliable, and transparent 13C-NMR chemical shift predictions. Building on our previous model, we make improvements to the model architecture and train-ing data, while striving to enhance the model transparency. Leveraging advances in neural network poten-tials, a fourfold expansion of training data in terms of molecular and elemental coverage is made, resulting in a dataset containing around 170,000 experimental shifts cross-validated by DFT. To address DFT limitations, we developed an intelligent data augmentation strategy combining statistical analysis and machine learning predictions to further expand the dataset to 211,000 experimental values. With the expanded dataset and changes in model architecture, a state-of-the-art accuracy of 0.73 ppm was achieved when compared against experimental 13C-NMR shifts. The model also incorporates prediction confidence metrics using a deep-kernel learning architecture, as well as nearest-neighbor analysis, facilitated by a user-friendly web-server. Finally, we demonstrate the versatility of the final model using several real-world applications.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: Efficient prediction of transition state geometries from molecular strings of starting materials and products via E(3)-equivariant flow-matching.
#aichem https://doi.org/10.26434/chemrxiv-2025-bk2rh 
GoFlow: Efficient Transition State Geometry Prediction with Flow Matching and E(3)-Equivariant Neural Networks
Transition state (TS) geometries of chemical reactions are key to understanding reaction mechanisms and estimating kinetic properties. Inferring these directly from 2D reaction graphs offers chemists a powerful tool for rapid and accessible reaction analysis. Quantum chemical methods for computing TSs are computationally intensive and often infeasible for larger molecular systems. Recently, deep learning–based diffusion models have shown promise in generating TSs from 2D reaction graphs for single-step reactions. However, framing TS generation as a diffusion process, by design, requires a prohibitively large number of sampling steps during inference. Here we show that modeling TS generation as an optimal transport flow problem, solved via E(3)-equivariant flow matching with geometric tensor networks, achieves over a hundredfold speedup in inference while improving geometric accuracy compared to the state-of-the-art. This breakthrough increase in sampling efficiency and predictive accuracy enables the practical use of deep learning-based TS generators in high-throughput settings for larger and more complex chemical systems. Our method, GoFlow, thus represents a significant methodological advancement in machine learning-based TS generation, bringing it closer to widespread use in computational chemistry workflows.
ChemRxiv#RobSelects paper of the week
#JCTC: Implementing computationally efficient and precise second-order derivatives for an equivariant graph neural network architecture for molecules via automatic differentiation.
#aichem https://doi.org/10.1021/acs.jctc.4c01790 #RobSelects preprint of the week
#ChemRxiv: A universal neural network potential for excited state simulations of organic molecules.
#aichem https://doi.org/10.26434/chemrxiv-2025-j207x 
OMNI-P2x: A Universal Neural Network Potential for Excited-State Simulations
Photo-active molecular systems play an essential role in modern science and technology, finding applications in solar cells, organic light-emitting diodes (OLEDs), reaction catalysis, photodynamic therapy, and beyond. The rational design of photo-responsive molecules requires understanding of the photophysical and photochemical processes underlying their operation. This understanding can be gained via the first-principles quantum-mechanical (QM) calculations which, however, turn out prohibitively expensive for high-throughput investigations. To break through this limitation, here we introduce OMNI-P2x: the first universal neural network potential for molecular excited and ground electronic states. OMNI-P2x can be used, directly or after fine-tuning, in place of quantum-mechanical methods to perform a wide range of photophysical and photochemical simulations. OMNI-P2x is approaching the accuracy of time-dependent density functional theory (TD-DFT) methods at a fraction of the cost. Remarkably, this universal potential is more accurate and faster than established semiempirical QM methods, marking the watershed moment in theoretical method development for excited-state simulations. Here, we demonstrate its use in UV/Vis absorption spectroscopy, in real-time photodynamical simulations, and in the rational design of the visible-light-absorbing azobenzene systems.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: Combining semiempirical quantum chemistry with transferable neural network potentials and an atom-pairwise dispersion correction in AIQM2
#aichem https://doi.org/10.26434/chemrxiv-2024-j8pxp-v2 
AIQM2: Organic Reaction Simulations Beyond DFT
Density functional theory (DFT) is the workhorse of reaction simulations but it either suffers from prohibitive cost or insufficient accuracy. In this work, we report AIQM2, the universal AI-enhanced QM Method 2, the first method that enables fast and accurate large-scale organic reaction simulations for practically relevant system sizes and time scales beyond what is possible with DFT. This breakthrough is based on the outstanding speed of AIQM2, orders of magnitude faster than common DFT, while its accuracy in reaction energies, transition state optimizations, and barrier heights is at least at the level of DFT and often approaches the gold-standard coupled cluster accuracy. AIQM2 can be used out of the box without any further retraining. Compared to pure machine learning potentials, AIQM2 possesses high transferability and robustness in simulations without catastrophic breakdowns. We showcase the superiority of AIQM2 compared to traditional DFT by performing an extensive reaction dynamics study overnight and revising the mechanism and product distribution reported in the previous investigation of the bifurcating pericyclic reaction.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: Reactive machine learning potential via an automated dataset generation procedure and training via the ANI model framework.
#aichem https://doi.org/10.26434/chemrxiv-2025-m2nqq 
ANI-1xBB: an ANI based reactive potential
Reactive potentials serve as essential tools for investigating chemical reactions with moderate computational costs. However, traditional reactive potentials often depend on fixed, semi-empirical parameters, which limits their accuracy and transferability. Overcoming these limitations can significantly expand the applicability of reactive potentials, enabling the simulation of a broader range of reactions under diverse conditions and the prediction of reaction properties, such as barrier heights. This work introduces ANI-1xBB, a novel ANI-based reactive ML potential trained on off-equilibrium molecular conformers generated through an automated bond-breaking workflow. ANI-1xBB significantly enhances the prediction of reaction energetics, barrier heights, and bond dissociation energies, surpassing conventional ANI models. Our results show that ANI-1xBB improves transition state modeling and reaction pathway prediction while generalizing effectively to pericyclic reactions and radical-driven processes. Furthermore, the automated data generation strategy supports the efficient construction of large-scale, high-quality reactive datasets, reducing reliance on expensive QM calculations. This work highlights ANI-1xBB as a practical model for accelerating the development of reactive machine learning potentials, offering new opportunities for modeling reaction phenomena.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: Relating oligopeptide sequence to aggregation propensity.
#aichem https://doi.org/10.26434/chemrxiv-2025-wjbmv 
Amino Acid Composition drives Peptide Aggregation: Predicting Aggregation for Improved Synthesis
Peptide aggregation is a long-standing challenge in chemical peptide synthesis, limiting its efficiency and reliability. Although data-driven methods have enhanced our understanding of many sequence-based phenomena, no comprehensive approach addresses so-called “non-random difficult couplings” (generally linked to aggregation) during solid-phase peptide synthesis. Here, we leverage existing peptide synthesis datasets, supplemented with newly acquired experimental data, to build a predictive model that deciphers the role of individual amino acids in triggering aggregation. First, we identified and experimentally validated composition-dependent aggregation as a stronger predictor than sequence-based patterns. This insight enabled the development of a composition vector representation, allowing insights into the aggregation propensities of individual amino acids. Applying an ensemble of trained models, we predict the aggregation properties of peptides and recommend optimized synthesis conditions. By elucidating each individual amino acid’s influence, this method holds the potential to accelerate synthesis optimization through existing data, offering a robust framework for understanding and controlling peptide aggregation.
ChemRxiv#RobSelects preprint of the week
#ChemRxiv: Predicting experimental synthesis procedures from reactant and product SMILES through many narrowly fine-tuned large language models.
#aichem https://doi.org/10.26434/chemrxiv-2025-dc28b 
Collective Intelligence of Specialized Language Models Guides Realization of de novo Chemical Synthesis
While hundreds of thousands of new chemical reactions are reported annually, efficient use of this vast collection of synthetic knowledge remains a persistent challenge in modern chemistry. Recent applications of large language models (LLMs) have shown promise, but systems that reliably work for de novo compounds and molecular transformations have remained elusive. Here we introduce MOSAIC (Multiple Optimized Specialists for AI-Driven Chemical Prediction), a computational framework that enables chemists to harness the collective knowledge of millions of reaction protocols. In contrast to existing approaches relying on agentic models, MOSAIC leverages the open-source Llama3.1-8B-instruct architecture. By training 2,489 specialized chemical experts on Voronoi-clustered reaction spaces, we establish a scalable paradigm that delivers reproducible and human-readable experimental protocols for complex syntheses. Experimental validation demonstrates MOSAIC's ability to predict and execute previously unreported transformations, including challenging reactions via Buchwald-Hartwig amination, Suzuki coupling, and olefin metathesis. We validate this approach through the successful synthesis of over 35 novel compounds spanning pharmaceuticals, materials, agrochemicals, and cosmetics. This framework establishes a new relationship between computational and experimental chemistry, providing a foundation for accelerated chemical discovery across disciplines.
ChemRxiv#RobSelects paper of the week
#ChemicalScience: Variational autoencoder for property-directed inverse design of organic copolymers.
#aichem https://doi.org/10.1039/D4SC05900J