Spent most of the week writing an EState count #cheminformatics fingerprint for #chemfp .

It should have been a few hours to build on RDKit's EState code. Perhaps a bit longer to implement a faster version using the same SMARTS patterns.

I then realized the RDKit implementation and patterns had problems, eg, not matching both atoms in "CC", and unexpected handling of explicit hydrogens, like in deuterated [2H]. See https://git.sr.ht/~dalke/rdkit/log

The hard part was finding good test cases.

The #cheminformatics fingerprint tool #chemfp version 5.1b1 is out! https://chemfp.com/

The big feature is integration of the new "superimposed" count simulation method to RDKit byte fingerprint generation.

Use it if you want Tanimoto similarity of count fingerprints, but don't want to toss out all of your existing byte fingerprint tools for similarity search, clustering, etc. nor take a big performance loss.

Instead, use superimposed and get a ~0.95 recall using "normal" byte fps.

A fast and comprehensive Python package for cheminformatics fingerprints.

Porting #chemfp to #Python 3.14 was a couple hours. Had to update Cython. Found a ctypes change I don't understand, but the unit tests pass. Had to tweak tests as a few Python error messages changed.

Here's one. In the ValueError raised from the following incorrect date:

import datetime
datetime.datetime.fromisoformat("2026-02-29")

Python 3.13: "day is out of range for month"

Python 3.14: "day 29 must be in range 1..28 for month 2 in year 2026"

Nice! Not mentioned in the release notes.

I've not been able to find published work on this topic, or even how to carry out an effective search for published work.

Pointers or ideas gladly accepted.

I discussed this topic previously on Twitter back in 2021. Decided to re-investigate it now as a one day break from working on count fingerprints in #chemfp :) (5/5)

I just submitted a #cheminformatics preprint to ChemRxiv, based on the #RDKit count fingerprints, #chemfp, and some one-off R&D code I wrote over the last few months.

"Superimposed Coding of Count Fingerprints to Binary Fingerprints"

In short, my superimposed coding method gives k-recall@k nearest neighbor scores ~0.9 relative to using full count fingerprints and the multiset Tanimoto (aka MinMax, aka Ruzicka similarity). Recall can be over 0.95 w/ 8192 bits!

https://chemfp.com/SuperimposedCounts.pdf

If you've money left in your #cheminformatics, #chemfp makes a nice Christmas present. :)

Any ideas how to better promote #chemfp? I'm only on the Fediverse with only a handful of #cheminformatics posters (including a couple of chemfp users!). Mailing lists are almost dead.

At ICCS many knew of me, but I found no leads.

There's all sort of great stuff in chemfp, but as they say, great marketing beats great engineering ... and evidence shows I'm not a good marketer.

I've tried cold emailing relevant people. That didn't work.

Ideas? Mail physical fliers? Skywriting over Cambridge?

All together, I might get a 10-fold faster system? And I should be able to extend it to count-based fingerprints like MACCS and PubChem/CACTUS.

But ... who cares? No one has said they are performance limited with these fingerprints. No one has asked me to support more SMARTS-based fingerprints. These fingerprints exist for historical reasons, not for active science.

It certainly doesn't make me any money, and keeps me from marketing chemfp.

So, .. anyone want to buy a #chemfp license?