Research Engineering @ Turing

@hut23
16 Followers
10 Following
10 Posts
We are research software engineers and data scientists connecting research to applications at The Alan Turing Institute, the UK's national institute for data science and AI.

Shoutout to our very own Rosie Wood, who recently presented two contributions at CoSeC/CIUK 2025:
- A talk at on porting Microsoft's Aurora foundation model for global weather prediction to the DAWN supercomputing cluster at Cambridge University! 🌍
- A poster at CIUK 2025 on the MapReader tool and it's use for the preservation & conservation of landscapes at England's national parks! 🌳

Fantastic work 🥳 ✨

Full paper with more info: https://arxiv.org/abs/2510.07192
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

arXiv.org

New work on poisoning LLMs with small numbers of documents! Featuring our very own Ed Chapman! Top of Hacker News this morning, great job team 🥳

Check out the post here: https://www.turing.ac.uk/blog/llms-may-be-more-vulnerable-data-poisoning-we-thought

We also have a write-up with a more detailed context here: http://sites.computer.org/debull/A24june/p50.pdf .

In close collaboration with the UCLH Trust, we've developed SqlSynthGen (SSG): a Python library for generating synthetic data from relational databases. SSG is designed with transparency in mind, so data owners can control and audit the data they choose to expose.

SSG is available on GitHub at https://github.com/alan-turing-institute/sqlsynthgen — check it out!

GitHub - alan-turing-institute/sqlsynthgen: Synthetic data for SQL databases

Synthetic data for SQL databases. Contribute to alan-turing-institute/sqlsynthgen development by creating an account on GitHub.

GitHub
Getting data into the hands of researchers is essential for making progress in machine learning and artificial intelligence. However, in sensitive domains like health or finance, this cannot be done without compromising the privacy of the data subjects. A solution to this problem is to generate synthetic data that shares the statistical properties of the original dataset, without including any personal information. We've been working on something to do just that! (See below 👇)

Ever wondered what it's like attending the world's biggest Open Source conference? Some of our team recently attended FOSDEM 2025, and wrote up their reflections and insights from the weekend, with plenty of links to materials they enjoyed! Hear from Arielle, Markus, Rosie, David, and Jim on what they thought was worth visiting👇

https://medium.com/@turinghut23/highlights-and-reflections-from-fosdem-2025-987fabd26932

Highlights and reflections from FOSDEM 2025 - Research Engineering at the Turing - Medium

FOSDEM, the Free and Open Source Developers of Europe Meeting, is an event held in Brussels each year at the start of February, bringing together thousands of open source software enthusiasts. While…

Medium
Carlos Gavidia-Calderon details his journey from software engineering to research, and touches on his experiences of mediating the two disciplines through a talk at RSECon: https://buff.ly/4jwCDv5
Isabel Fenton tells us about her recent work on biodiversity and renewable energy, along with recounting her fascinating background that combines fossils and climate science: https://buff.ly/4awna9Z
Curious about the people behind our work? We just published two spotlight interviews with members of our team: check them out below! 👇