RSVP at https://luma.com/6p89x1s2?tk=TGd9ER
#chemistry #drugDiscovery #notebook #python #RDKit #sqlite
Here's an #RDKit #cheminformatics quiz for you all. What do you think this code will output?
from rdkit import Chem
mol = Chem.MolFromSmiles("C" + "C(C)(C)" * 50 + "C")
pat = Chem.MolFromSmarts("[$([CD4H0X4](-*)(-*)(-*)-*)]")
print(len(mol.GetSubstructMatches(pat)))
No cheating by actually running the code! :) Feel free to explain your reasoning in the comments.
I just submitted a #cheminformatics preprint to ChemRxiv, based on the #RDKit count fingerprints, #chemfp, and some one-off R&D code I wrote over the last few months.
"Superimposed Coding of Count Fingerprints to Binary Fingerprints"
In short, my superimposed coding method gives k-recall@k nearest neighbor scores ~0.9 relative to using full count fingerprints and the multiset Tanimoto (aka MinMax, aka Ruzicka similarity). Recall can be over 0.95 w/ 8192 bits!
#OpenBabel is dead, long live #RDKit!
https://github.com/RMeli/spyrmsd/issues/149
On a more serius note, it would be cool to have a cheminformatics library that actually works. Don't get me wrong, RDKit is very cool - but you can feel all the underlying problems it has when using it.
Hey, @egonw - I'm working on a preprint.
How do I cite a source code file in the #rdkit and a commit message? FWIW, I use #Zotero.
"The RDKit implementation [of the multiset Tanimoto] was added in 2009, using fuzzy set operations already available for multiset Dice similarity."
"added in 2009" is commit 104efc5b607baa54ce0804c6a76d484bf9f78b57 at https://github.com/rdkit/rdkit/commit/423433a3e47df64af4a31888e835144e8b3a6c07#diff-d7a0f684fa993bfd84319df4d23b199973d13599b94ad6a4b3a6c79ed7d46719
"fuzzy set operations" is a reference to the two operations starting at https://github.com/rdkit/rdkit/blob/af4e6c05eca09efa8e8f61603937e0d997fc1499/Code/DataStructs/SparseIntVect.h#L132
Or am I overthinking?
#RDKit Atom Pair count fingerprints are wild! If you sum the per-record counts you'll see patterns like:
sum num_records
399 489
400 6
401 0
402 0
403 36
404 0
405 3
406 133423
407 9
408 6
409 42
410 6
411 0
412 36
...
493 83
494 1
495 3
496 116222
497 10
498 3
499 47
500 0
Not easy to plot! No doubt due to rings and chains causing highly repetitive path lengths. (406=2x7x29, 496=8x31)
I used chembl-downloader to create some nice charts on how the number of compounds, assays, activities, and other entities in ChEMBL have grown over time
📖 https://cthoyt.com/2025/08/26/chembl-history.html
#chembl #chemistry #chemometrics #chemoinformatics #cheminformatics #rdkit #cdk #proteochemometrics
I’ve recently submitted an article to the Journal of Open Source Software (JOSS) describing chembl-downloader, a Python package for automating downloading and using ChEMBL data in a reproducible way. In this post, I use chembl-downloader to show how the number of compounds, assays, activities, and other entities in ChEMBL have changed over time.
Today's #RDKit blog post gets into the weeds of how inconsistent information in a common file format is handled by the RDKit.
https://greglandrum.github.io/rdkit-blog/posts/2025-08-22-interpreting-the-2d3d-flag.html