Last week Jay Allamar interviewed me to discuss some of the tools I've been working on in the past two years.

If you haven't seen it, we discuss human-learn, doubtlab, embetter, and bulk!

Watch-able here:
https://www.youtube.com/watch?v=KRQJDLyc1uM

Tools to Improve Training Data - Talking Language AI Ep#2

YouTube

The first tool, human-learn, gives you scikit-learn compatible tools to just turn your domain knowledge into classifiers/regressors/detectors/transformers.

One main feature: you can turn functions with keyword arguments into gridsearch-able components!

https://github.com/koaning/human-learn/

GitHub - koaning/human-learn: Natural Intelligence is still a pretty good idea.

Natural Intelligence is still a pretty good idea. Contribute to koaning/human-learn development by creating an account on GitHub.

GitHub

The second tool, doubtlab, gives you a suite of tools to try and discover doubtful labels in your training data.

There are a bunch of reasons to doubt a label, and this library makes it easy to just try some.

https://github.com/koaning/doubtlab

GitHub - koaning/doubtlab: Doubt your data, find bad labels.

Doubt your data, find bad labels. . Contribute to koaning/doubtlab development by creating an account on GitHub.

GitHub

Embetter is a utility library to make it easier to use embeddings from scikit-learn. It currently supports text and image embeddings, and it makes it super easy to build few-short classifiers from sklearn.

Soon it will also have fine-tunable components!

https://github.com/koaning/embetter

GitHub - koaning/embetter: just a bunch of useful embeddings for scikit-learn pipelines

just a bunch of useful embeddings for scikit-learn pipelines - koaning/embetter

GitHub

Finally there's bulk, which gives you a user-interface to easily bulk label training data by re-using embetter with UMAP.

https://github.com/koaning/bulk

GitHub - koaning/bulk: A Simple Bulk Labelling Tool

A Simple Bulk Labelling Tool. Contribute to koaning/bulk development by creating an account on GitHub.

GitHub

There are a bunch more features/tools in the pipeline too. But I wanted to give a shoutout to @explosion, who have been very supportive of these tools.

Also, I've seen some of the internal demos. There's a lotta cool new stuff on the way.

https://explosion.ai/

Explosion · Makers of spaCy, Prodigy, and other AI and NLP developer tools

Explosion is a software company specializing in developer tools for Artificial Intelligence and Natural Language Processing. We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP.

Explosion