🚀 Our #PyDataGlobal 2025 tutorial recording on modern #Blosc2 & #Caterva2 features is out!
We show how compression is more than just a space saver, boosting performance for large in-memory & out-of-memory arrays via auto-chunking & parallelism.

We also cover: 🌐 Serving Blosc2/#HDF5 data online with Caterva2 ☁️ Computing directly in the cloud (no downloads needed!)

Watch here: 👉 https://www.youtube.com/watch?v=tUvSI3EpTBQ&list=PLGVZCDnMOq0qmerwB1eITnr5AfYRGm0DF&index=80

#Python #DataScience #BigData #HPC #DataHandling

My talk "Building Knowledge Graph-Based Agents with Structured Text Generation" at PyData Global is now available on YouTube: https://www.youtube.com/watch?v=94yuQKoDKkE
#PyData #PyDataGlobal
Alonso Silva - Building Knowledge Graph-Based Agents with Structured Text Generation

YouTube

Really looking forward to PyData Global 2024 (online) !!

I'll be presenting
"Catching Bad Guys using open data and open models for graphs"
Thu Dec 5, 14:30-15:00 BST
https://global2024.pydata.org/cfp/talk/XMU9X9/

#PyDataGlobal #Senzing #ERKG #knowledgegraphs #AI #darkmoney #AML #entityresolution #opendata

Catching Bad Guys using open data and open models for graphs PyData Global 2024

Entity resolution (ER) is a complex process focused on data quality, used for constructing and updating knowledge graphs (KGs). GraphRAG is a popular way to use KGs to ground AI apps. Most GraphRAG tutorials use LLMs to build graph automatically from unstructured data. However, what if you're working on use cases such as investigative journalism and sanctions compliance -- "catching bad guys" -- where transparency for decisions and evidence are required? This talk shows how to construct an investigative graph about potential money laundering, using ER to merge open data from ICIJ Offshore Leaks, Open Ownership, and OpenSanctions. First we'll build a "backbone" for the graph in ways which preserve evidence and allow for audits. Next we'll use spaCy pipelines to parse related news articles, using `GLiNER` to extract entities, then the new `spacy-lancedb-linker` to link them into the graph. Finally, we'll show graph analytics that make use of the results -- tying into what's needed for use cases such as GraphRAG. This approach uses Python open source libraries, e.g., the `KùzuDB` graph database and `LanceDB` vector database. For each NLP task we use state-of-the-art open models (mostly not LLMs) emphasizing how to tune for a domain context: _named entity recognition_, _relation extraction_, _textgraph_, and _entity linking_. Overall, we show how to leverage open data, open models, and open source to build investigative graphs which are accountable, exploring otherwise hidden relations in the data that indicate fraud or corruption. This illustrates techniques in production use cases for anti-money laundering (AML), ultimate beneficial owner (UBO), rapid movement of funds (RMF), and other areas of sanctions compliance in general. All of the code is provided on GitHub, organized in Jupyter notebooks.

Today Juan Luis Cano Rodríguez from QuantumBlack, AI by McKinsey will give a workshop at PyData Global titled "Who needs ChatGPT? Rock solid AI pipelines with Hugging Face and Kedro" in which attendees will learn how to create a complex AI pipeline using Hugging Face transformers and turn it into a Kedro project that cleanly separates code from configuration and data.

Tune in at 16:00 UTC! https://global2023.pydata.org/cfp/talk/NFZDPN/

#python #pydata #pydataglobal #pydataglobal2023 #kedro #huggingface #aipipelines

Who needs ChatGPT? Rock solid AI pipelines with Hugging Face and Kedro PyData Global 2023

In this tutorial you will learn how to create a complex AI pipeline using Hugging Face transformers, turn it into a Kedro project that cleanly separates code from configuration and data, and deploy it to production so it starts delivering value. To that end, we will build a system that summarizes and classifies social media posts using several Hugging Face pre-trained models. The outline will be as follows: 1. Introduction (5m) 2. Who needs ChatGPT? Commercial vs open-source AI (5m) 3. Fighting spaghetti data science with Kedro (15m) 4. Using Hugging Face models (15m) 5. Separating code from data using the Kedro catalog (10m) 6. Refactoring the code using Kedro pipelines (20m) 7. Deploying to production (15m) 8. Conclusions

#PyDataGlobal2023 just started and the very first talk was about good practices around Jupyter notebooks: write functions, use git, don't deploy them to production, etc.

We've been on this theme for *years*, and we keep insisting. Aren't we missing some key usability issues around the workflows we propose?

For example, functions: `%autoreload` is known to be flaky https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html#caveats and yet there's no good solution for developing library code and notebooks together.

#PyDataGlobal #python

autoreload — IPython 8.18.1 documentation

@freakboy3742 @webology I think the #Rstats and #Rladies community has had some success with online events, @yabellini was (and probably is) passionate about them when I met her (online of course) during the worst of the COVID-19 pandemic.

My experience has always been like yours, I have never experienced a "cool" online conference... closest was #PyDataGlobal 2020-2021 with that RPG-like map. I think there's people out there that know how to run these things.

#ChatGPT, #Pytorch 2.0 and #PyDataGlobal and a lot more highlights from last week.

Check out everything at https://pedromadruga.com/newsletter/

#datascience #machinelearning #ai #newsletter

Newsletter

Hi! This newsletter is a way to get the content I write on my blog delivered to your inbox. Alternatively, you can subscribe to my RSS feed. I’ll write about industry-applied Artificial Intelligence, which is my main expertise. Specifically, about Information Retrieval using Generative AI, but not exclusively. I’m a big fan of AI-based product development, so you can expect a mix of research and anecdotal writings. I’ll also experiment with open-source LLMs, hack with my Raspberry Pis and Neovim, and more.

Pedro Madruga

The basic idea of #SyntheticControl for #causalinference is actually really simple.

Find out more in my #pydataglobal talk tomorrow
https://global2022.pydata.org/cfp/talk/FQBSP8/

What-if? Causal reasoning meets Bayesian Inference PyData Global 2022

## Core objectives: - Make the case that causal reasoning is required to answer many important questions in research and business. - Flesh out how causal reasoning and Bayesian inference complement each other. - Convey how some what-if questions can be answered using Synthetic Control methods. - Illustrate how to use Synthetic Control methods in practice with a worked example with Python code snippets (using PyMC) and empirical results. - Introduce the new Python package [CausalPy](https://github.com/pymc-labs/CausalPy). The talk will be a high-level overview, with very few (if any) equations. Rather, I focus on conveying the intuition and practical steps to answer what-if questions through concrete examples. I will provide references for those wishing to flesh out their understanding after the talk. This talk is aimed at a broad audience - anyone wanting to learn about the causal structure of the world, whether for fun or profit. Knowledge of causal inference is not assumed, but a beginner to intermediate knowledge of data science would be beneficial. Some familiarity with Bayesian methods would be beneficial, but are not required. ## Talk structure: - I will provide an overview of ‘what-if?’ questions including: “What would have happened to this patient if they had taken the drug rather than the placebo?” or “How much did an advertising campaign drive the change in user sign-ups?” - Establish why we cannot solve our problems with traditional statistical and data science methods in the absence of causal reasoning. - Describe how causal reasoning questions are complemented by the Bayesian approach, namely quantifying our uncertainty, and a focus on parameter estimation instead of hypothesis testing with p-values. - One main example will focus on how to approach the question “How did Brexit causally affect the United Kingdom’s GDP despite this not being a randomized experiment?” I will intuitively explain how the Synthetic Control method works (by creating a synthetic United Kingdom as a weighted sum of other countries unaffected by Brexit) and how we can implement this, with PyMC code snippets. - I will summarize by: a) outlining the bounds of Synthetic Control and when other approaches are called for, b) highlight available Python and R packages (CausalImpact, tfcausalimpact, GeoLift, and a PyMC-based solution), and c) providing further reading and learning resources. ## References - Cunningham, Scott. "Causal inference." Causal Inference. Yale University Press, 2021 - Huntington-Klein, N. (2021). The effect: An introduction to research design and causality. Chapman and Hall/CRC. - Facure, M (2021) Causal Inference for The Brave and True, https://github.com/matheusfacure/python-causality-handbook ## GitHub repository A supporting GitHub repository, with notebooks, can be found at [drbenvincent/pydata-global-2022](https://github.com/drbenvincent/pydata-global-2022).

Today I'm testing #Jupyter #notebooks on #VSCode for real during @crazy4pi314 workshop at #PyDataGlobal and to be honest I'm getting a bit grumpy.

- Selecting the kernel didn't allow me to introduce a custom path, I had to open a Python file, add an interpreter there, and then use that as a kernel
- The autocomplete makes no sense if ipykernel is not installed, why not show that before trying to use it? (see GIF)
- Some shortcuts, like `dd` to delete cells, don't work

Not a great DX.

Almost ready to start at #pydataglobal talking about #vscode #jupyternotebooks 💖
https://aka.ms/pydataglobal
GitHub - crazy4pi314/pydataglobal2022: PyData Global Workshop: Jupyter Notebooks in VS Code

PyData Global Workshop: Jupyter Notebooks in VS Code - GitHub - crazy4pi314/pydataglobal2022: PyData Global Workshop: Jupyter Notebooks in VS Code

GitHub