Mastodawn

Soda - Inria Nov 22, 2023

Gaël Varoquaux Nov 22, 2023

🎉 Tool for better documentation!! Release of sphinx-gallery, to automatically integrate narrative 🐍 examples in documentations
https://sphinx-gallery.github.io/stable/index.html

Highlight: a light recommender system to show related examples

An illustration of sphinx-gallery:
https://scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html
(from @sklearn 's gallery). Note the links to function docs.

Sphinx-gallery comes with awesome features such as
◼online execution with binder or jupyterlite
◼mini-galleries eg to link an object's docstring to its examples

Sphinx-Gallery — Sphinx-Gallery 0.14.0-git documentation

Soda - Inria Aug 30, 2023

Gaël Varoquaux Aug 30, 2023

🎓👨‍🦱👩 Post-doc: From missing values to deep learning on sets
https://team.inria.fr/soda/job-offers

with myself and Marine le Morvan
at @Soda_Inria

Come work with us on an exciting topic across statistics and deep learning

Job offers – Soda – Computational and mathematical methods to understand health and society with data

Soda - Inria Aug 30, 2023

Guillaume Lemaitre Aug 30, 2023

If you want to see the replay of my talk at EuroSciPy 2023 regarding classifier tuning and the misconception behind class imbalance, here you go: https://youtu.be/6YnhoCfArQo

Slides are available at: https://docs.google.com/presentation/d/1IPXbEZpfrynjJMTXjI36rNGuPpOtY6_yJIrQOH6JBpI/edit?usp=sharing

EuroSciPy 2023 - Get the best from your scikit-learn classifier

YouTube

Soda - Inria Jun 28, 2023

Show thread

Olivier Grisel Jun 28, 2023

As a side benefit of this refactoring, the traceback of an exception raised in sequential mode (`n_jobs=1`) is now flatter.

3/4

Soda - Inria Jun 28, 2023

Show thread

Olivier Grisel Jun 28, 2023

In the future this will also be extended to `return_as="unordered_generator"` to optionally make it possible to aggregate results as soon as ready.

This release also includes a new `parallel_config` context manager as an extension to `parallel_backend` to make it possible to configure all the arguments of the `Parallel` class and not just the backend using a context manager idiom.

Detailed changelog:
https://github.com/joblib/joblib/blob/master/CHANGES.rst#release-130----20230628

2/4

joblib/CHANGES.rst at master · joblib/joblib

Computing with Python functions. Contribute to joblib/joblib development by creating an account on GitHub.

GitHub

Soda - Inria Jun 28, 2023

Olivier Grisel Jun 28, 2023

joblib 1.3.0 is out in the wild!

joblib is a library that provides an generic way to call into thread-based, process-based and distributed parallelism (via external backends) + a way to cache expensive computation in repeated function calls on disk.

https://joblib.readthedocs.io

This new release provides several major new features, inclusing a `return_as="generator"` argument to the `Parallel`class to make it possible to aggregate parallel results when ready (preserving the submission order).

1/4

Joblib: running Python functions as pipeline jobs — joblib 1.5.3 documentation

Soda - Inria Mar 24, 2023

The team's annual report is out!
It's our first year, we are still ramping up, but our efforts project our vision:
https://radar.inria.fr/report/2022/soda/index.html

Next year will be even more exciting, as we have many ongoing research, in statistical learning, data management, health or education.

SODA - 2022 - Annual activity report

Soda - Inria Feb 20, 2023

Gaël Varoquaux Feb 20, 2023

dirty_cat's TableVectorizer automatically turns complex dataframes into numerical data matrices ready for learning.

Piped with sklearn's HistGradientBoosting, it gives a strong default for learning on tables. Together, they form my go-to learner
https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-table-vectorizer

Dirty categories: machine learning with non normalized strings

Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input or assemb...

dirty_cat

Soda - Inria Feb 20, 2023

Gaël Varoquaux Feb 20, 2023

Tabular data can benefit from merging external sources of information.

The FeatureAugmenter is a sklearn transformer to augment a given dataframe by joins on reference tables.
https://dirty-cat.github.io/stable/generated/dirty_cat.FeatureAugmenter.html

fuzzy_join makes it robust to mismatch in vocabulary. Hyperparameter optimization can tune matches for prediction

For such external information,
diry-cat can download embeddings of wikipedia data on millions of entities: companies, cities, geographic locations...
https://dirty-cat.github.io/stable/auto_examples/07_ken_embeddings_example.html

dirty_cat.FeatureAugmenter

Usage examples at the bottom of this page. Examples using dirty_cat.FeatureAugmenter: Fuzzy joining dirty tables and the FeatureAugmenter Fuzzy joining dirty tables and the FeatureAugmenter Wikiped...

dirty_cat

Soda - Inria Feb 20, 2023

Gaël Varoquaux Feb 20, 2023

Wrangling categories/entities with typos?

dirty-cat's fuzzy_join function is similar to pandas' merge function, but caters for typos by matching with string similarities across the two tables
https://dirty-cat.github.io/stable/generated/dirty_cat.fuzzy_join.html

The deduplicate function enables merging multiple morphological variants of the same string, to recover a category from data with typo:
https://dirty-cat.github.io/stable/generated/dirty_cat.deduplicate.html

dirty_cat.fuzzy_join

Examples using dirty_cat.fuzzy_join: Fuzzy joining dirty tables and the FeatureAugmenter Fuzzy joining dirty tables and the FeatureAugmenter

dirty_cat

Website	https://team.inria.fr/soda/
Github	https://github.com/soda-inria
Twitter	https://twitter.com/soda_INRIA