The rapid advancement of machine learning in computational chemistry has opened new doors for designing molecules, predicting molecular properties, and discovering novel materials. However, building scalable and robust models for molecular property prediction remains a significant challenge due to the vast size and complexity of chemical space. In this paper, we introduce ChemBERTa-3, an open-source training framework designed to train and fine-tune large-scale chemical foundation models. We explore the potential of multiple model architectures by evaluating their performance across various molecular datasets from the MoleculeNet suite. Our experiments demonstrated that pre-training on the expansive ZINC20 dataset yields models capable of performing well on both classification and regression tasks, providing valuable insights into drug discovery and materials science. For scalability, we leveraged both AWS-based Ray deployments and on-premise high-performance computing clusters to support the processing power required to train on billions of molecules. In support of reproducible and extensible science, we have open-sourced all ChemBERTa3 models.
Nuclear Magnetic Resonance (NMR) is one of the most powerful structural characterization techniques in molecular sciences. However, the complexity of NMR spectra can make structural assignments prone to er-rors. Here we introduce a deep learning model – CASCADE-2.0 (ChemicAl Shift CAlculation with DEep learn-ing), a practical tool designed to assist chemists in making fast, reliable, and transparent 13C-NMR chemical shift predictions. Building on our previous model, we make improvements to the model architecture and train-ing data, while striving to enhance the model transparency. Leveraging advances in neural network poten-tials, a fourfold expansion of training data in terms of molecular and elemental coverage is made, resulting in a dataset containing around 170,000 experimental shifts cross-validated by DFT. To address DFT limitations, we developed an intelligent data augmentation strategy combining statistical analysis and machine learning predictions to further expand the dataset to 211,000 experimental values. With the expanded dataset and changes in model architecture, a state-of-the-art accuracy of 0.73 ppm was achieved when compared against experimental 13C-NMR shifts. The model also incorporates prediction confidence metrics using a deep-kernel learning architecture, as well as nearest-neighbor analysis, facilitated by a user-friendly web-server. Finally, we demonstrate the versatility of the final model using several real-world applications.
Transition state (TS) geometries of chemical reactions are key to understanding reaction mechanisms and estimating kinetic properties. Inferring these directly from 2D reaction graphs offers chemists a powerful tool for rapid and accessible reaction analysis. Quantum chemical methods for computing TSs are computationally intensive and often infeasible for larger molecular systems. Recently, deep learning–based diffusion models have shown promise in generating TSs from 2D reaction graphs for single-step reactions. However, framing TS generation as a diffusion process, by design, requires a prohibitively large number of sampling steps during inference. Here we show that modeling TS generation as an optimal transport flow problem, solved via E(3)-equivariant flow matching with geometric tensor networks, achieves over a hundredfold speedup in inference while improving geometric accuracy compared to the state-of-the-art. This breakthrough increase in sampling efficiency and predictive accuracy enables the practical use of deep learning-based TS generators in high-throughput settings for larger and more complex chemical systems. Our method, GoFlow, thus represents a significant methodological advancement in machine learning-based TS generation, bringing it closer to widespread use in computational chemistry workflows.
Photo-active molecular systems play an essential role in modern science and technology, finding applications in solar cells, organic light-emitting diodes (OLEDs), reaction catalysis, photodynamic therapy, and beyond. The rational design of photo-responsive molecules requires understanding of the photophysical and photochemical processes underlying their operation. This understanding can be gained via the first-principles quantum-mechanical (QM) calculations which, however, turn out prohibitively expensive for high-throughput investigations. To break through this limitation, here we introduce OMNI-P2x: the first universal neural network potential for molecular excited and ground electronic states. OMNI-P2x can be used, directly or after fine-tuning, in place of quantum-mechanical methods to perform a wide range of photophysical and photochemical simulations. OMNI-P2x is approaching the accuracy of time-dependent density functional theory (TD-DFT) methods at a fraction of the cost. Remarkably, this universal potential is more accurate and faster than established semiempirical QM methods, marking the watershed moment in theoretical method development for excited-state simulations. Here, we demonstrate its use in UV/Vis absorption spectroscopy, in real-time photodynamical simulations, and in the rational design of the visible-light-absorbing azobenzene systems.
Density functional theory (DFT) is the workhorse of reaction simulations but it either suffers from prohibitive cost or insufficient accuracy. In this work, we report AIQM2, the universal AI-enhanced QM Method 2, the first method that enables fast and accurate large-scale organic reaction simulations for practically relevant system sizes and time scales beyond what is possible with DFT. This breakthrough is based on the outstanding speed of AIQM2, orders of magnitude faster than common DFT, while its accuracy in reaction energies, transition state optimizations, and barrier heights is at least at the level of DFT and often approaches the gold-standard coupled cluster accuracy. AIQM2 can be used out of the box without any further retraining. Compared to pure machine learning potentials, AIQM2 possesses high transferability and robustness in simulations without catastrophic breakdowns. We showcase the superiority of AIQM2 compared to traditional DFT by performing an extensive reaction dynamics study overnight and revising the mechanism and product distribution reported in the previous investigation of the bifurcating pericyclic reaction.
Reactive potentials serve as essential tools for investigating chemical reactions with moderate computational costs. However, traditional reactive potentials often depend on fixed, semi-empirical parameters, which limits their accuracy and transferability. Overcoming these limitations can significantly expand the applicability of reactive potentials, enabling the simulation of a broader range of reactions under diverse conditions and the prediction of reaction properties, such as barrier heights. This work introduces ANI-1xBB, a novel ANI-based reactive ML potential trained on off-equilibrium molecular conformers generated through an automated bond-breaking workflow. ANI-1xBB significantly enhances the prediction of reaction energetics, barrier heights, and bond dissociation energies, surpassing conventional ANI models. Our results show that ANI-1xBB improves transition state modeling and reaction pathway prediction while generalizing effectively to pericyclic reactions and radical-driven processes. Furthermore, the automated data generation strategy supports the efficient construction of large-scale, high-quality reactive datasets, reducing reliance on expensive QM calculations. This work highlights ANI-1xBB as a practical model for accelerating the development of reactive machine learning potentials, offering new opportunities for modeling reaction phenomena.