I just submitted a #cheminformatics preprint to ChemRxiv, based on the #RDKit count fingerprints, #chemfp, and some one-off R&D code I wrote over the last few months.

"Superimposed Coding of Count Fingerprints to Binary Fingerprints"

In short, my superimposed coding method gives k-recall@k nearest neighbor scores ~0.9 relative to using full count fingerprints and the multiset Tanimoto (aka MinMax, aka Ruzicka similarity). Recall can be over 0.95 w/ 8192 bits!

https://chemfp.com/SuperimposedCounts.pdf

#OpenBabel is dead, long live #RDKit!

https://github.com/RMeli/spyrmsd/issues/149

On a more serius note, it would be cool to have a cheminformatics library that actually works. Don't get me wrong, RDKit is very cool - but you can feel all the underlying problems it has when using it.

#Cheminformatics

Remove Open Babel support? · Issue #149 · RMeli/spyrmsd

Open Babel seems to have become abandonware. The last commit on master is from December 2024. The last release on GitHub is from 2020, and the same goes for the last release in PyPI. Open Babel is ...

GitHub

Hey, @egonw - I'm working on a preprint.

How do I cite a source code file in the #rdkit and a commit message? FWIW, I use #Zotero.

"The RDKit implementation [of the multiset Tanimoto] was added in 2009, using fuzzy set operations already available for multiset Dice similarity."

"added in 2009" is commit 104efc5b607baa54ce0804c6a76d484bf9f78b57 at https://github.com/rdkit/rdkit/commit/423433a3e47df64af4a31888e835144e8b3a6c07#diff-d7a0f684fa993bfd84319df4d23b199973d13599b94ad6a4b3a6c79ed7d46719

"fuzzy set operations" is a reference to the two operations starting at https://github.com/rdkit/rdkit/blob/af4e6c05eca09efa8e8f61603937e0d997fc1499/Code/DataStructs/SparseIntVect.h#L132

Or am I overthinking?

support Tversky similarity for SparseIntVects · rdkit/rdkit@423433a

The official sources for the RDKit library. Contribute to rdkit/rdkit development by creating an account on GitHub.

GitHub

#RDKit Atom Pair count fingerprints are wild! If you sum the per-record counts you'll see patterns like:

sum num_records
399 489
400 6
401 0
402 0
403 36
404 0
405 3
406 133423
407 9
408 6
409 42
410 6
411 0
412 36
...
493 83
494 1
495 3
496 116222
497 10
498 3
499 47
500 0

Not easy to plot! No doubt due to rings and chains causing highly repetitive path lengths. (406=2x7x29, 496=8x31)

Tonight I'm taking the train to Prague for the European edition of the 2025 #RDKit UGM.
I'm really looking forward to meeting a bunch of the community there!
We don't have space for any last-minute in-person registrations, but info on joining the live streams is here:
https://github.com/rdkit/UGM_2025/
GitHub - rdkit/UGM_2025: 2025 RDKit UGM

2025 RDKit UGM. Contribute to rdkit/UGM_2025 development by creating an account on GitHub.

GitHub
New pre-proof in Journal of Molecular Liquids: ML predicts NMR chemical shifts for metal complexes (45Sc, 49Ti, 89Y, 91Zr, 139La). CatBoost+RDKit ≈7% RMSE for Sc/Y/La; 9% Ti; 13% Zr. SHAP highlights cyclic motifs & electrostatics. Read: https://doi.org/10.1016/j.molliq.2025.128417 #NMR #MachineLearning #MaterialsScience #TransitionMetals #RDKit #CatBoost #SHAP

I used chembl-downloader to create some nice charts on how the number of compounds, assays, activities, and other entities in ChEMBL have grown over time

📖 https://cthoyt.com/2025/08/26/chembl-history.html

#chembl #chemistry #chemometrics #chemoinformatics #cheminformatics #rdkit #cdk #proteochemometrics

A historical analysis of ChEMBL

I’ve recently submitted an article to the Journal of Open Source Software (JOSS) describing chembl-downloader, a Python package for automating downloading and using ChEMBL data in a reproducible way. In this post, I use chembl-downloader to show how the number of compounds, assays, activities, and other entities in ChEMBL have changed over time.

Biopragmatics

Today's #RDKit blog post gets into the weeds of how inconsistent information in a common file format is handled by the RDKit.

https://greglandrum.github.io/rdkit-blog/posts/2025-08-22-interpreting-the-2d3d-flag.html

How the 2D/3D flag in Mol blocks is used – RDKit blog

Specifications meet the real world

chemfp 5.0b2 is out. Get it while it's hot! For Linux:

python -m pip install chemfp==5.0b2 -i https://chemfp.com/packages/

I'm still updating the documentation. See 'What's new in 5.0' at https://chemfp.com/docs/whats_new_in_50.html

* shardsearch - search many target files

* simhistogram - histogram all the scores

* FPB file now handles 1B+ records

* sparse count fingerprints
- new FPC format
- rdkit2fpc to make them with #RDKit
- fpc2fps to convert to binary fps
- fps2fpc for the other way

#cheminformatics

Packages from the chemfp project

It's official - the upcoming chemfp 5.0 release will have limited support sparse count #cheminformatics fingerprints, in addition to the normal binary fingerprints.

The new format is "FPC", a variant of the FPS format. Details at https://chemfp.com/fpc_format/.

There will also be "rdkit2fpc" for the four #RDKit count fingerprint generators.

Plus "fpc2fps" with several methods to convert sparse count features -> binary.

And "fps2fpc" for the reverse (it's just a list of on-bit indices.)

FPC format specification