#RobSelects preprint of the week #ChemRxiv: Extracting organometallic reactivity data from openly available electronic supplementary information documents. #cheminf https://doi.org/10.26434/chemrxiv-2025-ccgfs
Reaction Database for Catalysis and Organometallics via Freely Available Supplementary Information

Chemical reaction databases have become core scientific infrastructure. Most prominent datasets focus on or- ganic reactions, or only include reactants and product rather than full reaction pathways, leaving organometallic chemistry particularly underserved despite its centrality to homogeneous catalysis. This gap limits the develop- ment of machine learning models for organometallic reactions and limits applications in mechanism discovery, selectivity prediction, and catalyst design. This work introduces an open, reaction-centric resource derived from XXX SI across 50+ journals from seven publishers through 2025 using the Gold-DIGR (Gold-Data Integration for Generalized Reactions) workflow. Reported organometallic reactions are aggregated and reaction properties extracted or recalculated, including reactant, product, and transition-state geometries, intrinsic reaction coor- dinate (IRC) traces, reaction classes, and ligand/metal descriptors (coordination, valence-electron counts, bond orders). Bond–electron matrices enable electron-flow analyses along reaction coordinates, visualized as Sankey diagrams connecting local electron rearrangements to class-level patterns. The resulting corpus spans canoni- cal classes—oxidative addition, reductive elimination, migratory insertion, β-hydride elimination, C–H activation, transmetalation, and σ-bond metathesis—enabling quantitative mechanistic analyses at scale. As a demonstration of the meta-analyses enabled by this broad-based data generation, the relationship between bond-breaking/forming events and the transition states are studied to investigate concerted versus sequential scenarios. Class-specific tim- ing asymmetries emerge, with reductive elimination and β-atom elimination events skewed pre-transition-state, oxidative addition and migratory insertion skewed post-transition-state, and transmetallation showing the broad- est dispersion. By releasing both tooling and data, this work provides a foundation for mechanistic benchmarking and data-driven catalyst design.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Molecular fingerprint for transition metal compounds that combines ligand ECFP sum and metal electron configuration. #cheminf https://doi.org/10.26434/chemrxiv-2024-vqktn
ELECTRUM: An Electron Configuration-based Universal Metal Fingerprint for Transition Metal Compounds

Machine learning has experienced a drastic rise in interest and applications in all fields of chemistry, enabling researchers to leverage large chemical datasets to gain novel insights. The success of machine learning-driven projects in chemistry hinges on three key factors: access to robust and comprehensive datasets, a well-defined objective, and effective molecular representations that convert chemical structures into machine-readable formats. Transition metal complexes have lagged behind their organic counterparts on all three of these avenues. The large diversity of structures, coordination numbers and modes have made its translation to a machine-readable format an ongoing challenge. Here we introduce ELECTRUM, an electron configuration-based universal metal fingerprint for transition metal compounds. Its lightweight implementation enables the straightforward conversion of any transition metal complex into a simple fingerprint. Utilising a novel dataset generated from the Cambridge Structural Database (CSD), we demonstrate that ELECTRUM effectively captures the structural diversity of transition metal complexes. By plotting nearest-neighbor relationships in ELECTRUM space, we reveal meaningful clustering in two-dimensional representations. Furthermore, we use the ELECTRUM encoding to train machine learning models on the prediction of metal complex coordination numbers from ligand structures and metal identity alone. We show that on a subset of this data, we can train models to predict the oxidation state of metal complexes. These case studies showcase the potential of ELECTRUM as an easy-to-implement fingerprint for metal complexes. We rely on the community to further test, validate, and improve it.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Ligand additivity relations to predict properties of octahedral and square pyramidal transition metal complexes. #cheminf https://doi.org/10.26434/chemrxiv-2024-m39d9
Ligand Many-Body Expansion as a General Approach for Accelerating Transition Metal Complex Discovery

Methods that accelerate the evaluation of molecular properties are essential for chemical discovery. While some degree of ligand additivity has been established for transition metal complexes, it is underutilized in asymmetric complexes, such as the square pyramidal coordination geometries highly relevant to catalysis. To develop predictive methods beyond simple additivity, we apply a many-body expansion to octahedral and square pyramidal complexes and introduce a correction based on adjacent ligands (i.e., the cis interaction model, or cis model). We first test the cis model on adiabatic spin-splitting energies of octahedral Fe(II) complexes, predicting DFT-calculated values of unseen binary complexes to within an average of 1.4 kcal/mol. We next show that the cis model infers both DFT- and CCSD(T)-calculated model catalytic reaction energies to within 1 kcal/mol on average. The cis model predicts low-symmetry complexes with reaction energies outside the range of binary complex reaction energies. We observe that trans interactions are unnecessary for most monodentate systems but can be important for some combinations of ligands, such as complexes containing a mixture of bidentate and monodentate ligands. Finally, we demonstrate that the cis model may be combined with -learning to predict CCSD(T) reaction energies from exhaustively calculated DFT reaction energies and the same fraction of CCSD(T) reaction energies needed for the cis model, achieving around 30% of the error from using the CCSD(T) reaction energies in the cis model alone.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Combining template-based mechanism modeling with automated experimentation for discovering multicomponent reactions. #cheminf https://doi.org/10.26434/chemrxiv-2024-qfjh9-v3
Ideation and Evaluation of Novel Multicomponent Reactions via Mechanistic Network Analysis and Automation

Novel reactivity is paramount to accessing valuable chemical space. Chemists use mechanistic intuition in conjunction with modern reaction screening techniques to discover, invent, or optimise chemical reactions. We have codified this logic in an automated cheminformatic workflow as one approach to systematic reaction invention. Hundreds of expert-encoded elementary reaction templates were used to construct a highly connected mechanistic network. This network can be used to enumerate reaction pathways for a set of given input substrates and reagents, serving as a qualitative “virtual flask”. Our method’s predictive capability is first exemplified through the regeneration of mechanistic pathways to the main and potential side products of seven known multicomponent reactions. Then, we showcase its innovative capability in a multicomponent reaction invention pipeline that rapidly screens three component sets of starting materials for scenarios where two components form an intermediate that is captured by a third reactant. Two novel three component transformations proposed by the model were experimentally validated using robotically dosed parallel reaction plates employing a broad range of reaction conditions. We discuss the potential utility of these novel transformations and interrogate the kinetics of both reaction systems with a robot-operated assay.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Unique fragment-based molecular identifier accounting for stereochemistry. #cheminf https://doi.org/10.26434/chemrxiv-2024-k40v5-v2
MolBar: A Molecular Identifier for Inorganic and Organic Molecules with Full Support of Stereoisomerism

Before a new molecular structure is registered to a chemical structure database, a duplicate check is essential to ensure the integrity of the database. The Simplified Molecular Input Line Entry Specification (SMILES) and the IUPAC International Chemical Identifier (InChI) stand out as widely used molecular identifiers for these checks. Notable limitations arise when dealing with molecules from inorganic chemistry or structures characterized by non-central stereochemistry. When the stereoinformation needs to be assigned to a group of atoms, widely used identifiers cannot describe axial and planar chirality due to the atom-centered description of a molecule. To address this limitation, we introduce a novel chemical identifier called the Molecular Barcode (MolBar). Motivated by the field of theoretical chemistry, a fragment-based approach is used in addition to the conventional atomistic description. In this approach, the 3D structure of fragments are normalized using a specialized force field and characterized by physically inspired matrices derived solely from atomic positions. The resulting permutation-invariant representation is constructed from the eigenvalue spectra, providing comprehensive information on both bonding and stereochemistry. The robustness of MolBar is demonstrated through duplication and permutation invariance tests on the Molecule3D dataset of 3.9 million molecules. A Python implementation is available as open source and can be installed via pip install molbar.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Developing methods for balancing incomplete reaction equations. #cheminf https://doi.org/10.26434/chemrxiv-2024-hltm9
Reaction Rebalancing: A Novel Approach to Curating Reaction Databases

Purpose: Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need. Methods: The SynRBL framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities. Results: The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19% to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83% to 99.75% and 90.85% to 99.05%, respectively. Conclusion: The SynRBL framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction prediction. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.

ChemRxiv
#RobSelects paper of the week #JCIM: Open database of polar curly arrow reaction mechanisms. #cheminf https://doi.org/10.1021/acs.jcim.3c01810
#RobSelects preprint of the week #ChemRxiv: Computing the average similarity of a molecule set with linear scaling with molecule number. #cheminf https://doi.org/10.26434/chemrxiv-2023-fxlxg
iSIM: Instant Similarity

The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to get the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (Instant Similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented with binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: Introducing the Simple User-Friendly Reaction Format as a standard to document chemical reaction experiments. #cheminf https://doi.org/10.26434/chemrxiv-2023-nfq7h
Simple User-Friendly Reaction Format

Leveraging the increasing volume of chemical reaction data can enhance synthesis planning and improve suc- cess rates. However, machine learning applications for retrosynthesis planning and forward reaction prediction tools depend on having readily available, high-quality data in a structured format. While some public and licensed reaction databases are available, they frequently lack essential information about reaction condi- tions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is both human- and machine-readable, making it easier to share and integrate directly into machine- learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.

ChemRxiv
#RobSelects preprint of the week #ChemRxiv: A python package for automated extraction and analysis of liquid chromatography-mass spectrometry data. #cheminf https://doi.org/10.26434/chemrxiv-2023-1x288
Automated LC-MS Analysis and Data Extraction for High-Throughput Chemistry

High-throughput experimentation for chemistry and chemical biology has emerged as a highly impactful technology, particularly when applied to Direct-to-Biology. Analysis of the rich datasets which come from this mode of experimentation continues to be the rate-limiting step to reaction optimisation and the submission of compounds for biological assay. We present PyParse, an automated, accurate and accessible program for data extraction from high-throughput chemistry and provide real-life examples of situations in which PyParse can provide dramatic improvements in the speed and accuracy of analysing plate data. This software package has been made available through GitHub repository under an open-source Apache 2.0 licence, to facilitate the widespread adoption of high-throughput chemistry and enable the creation of standardised chemistry datasets for reaction prediction.

ChemRxiv