Do AI models help produce verified bug fixes?

"Abstract: Among areas of software engineering where AI techniques — particularly, Large Language Models — seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills?

To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the GoalQuery-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.

These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a finegrain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair"

https://www.arxiv.org/abs/2507.15822

#AI #GenerativeAI #LLMs #Debugging #Programming #APR #SoftwareDevelopment #SoftwareBugs

Do AI models help produce verified bug fixes?

Among areas of software engineering where AI techniques -- particularly, Large Language Models -- seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills? To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the Goal-Query-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs. These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a fine-grain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair.

arXiv.org
🚨 Behold, the pinnacle of human achievement: a newsletter glorifying bugs that can't even travel 500 miles—email surely is the new frontier of epic fails! 😂 Harley Hicks brings us the riveting tales of software nightmares, because who needs stability when you can have chaos? 📉 Sign up now, because who doesn't want to relive the glory of broken systems weekly! 📬
https://500mile.email/ #humanachievement #emailfail #softwarebugs #chaosnews #newsletterfun #HackerNews #ngated
500 Mile Email

500 Mile Email - Absurd Software Bug Stories

🎉 Breaking News: CPUs are magic and can predict the future! 🧙‍♂️ Forget about pesky bugs, who needs software that works when you have a "predictor" that sounds like a character from a bad sci-fi movie? 😂
https://blog.nelhage.com/post/ittage-branch-predictor/ #CPUsareMagic #FuturePrediction #BadSciFiTech #TechHumor #SoftwareBugs #HackerNews #ngated
The ITTAGE indirect branch predictor

Modern CPUs are actually pretty good at predicting the indirect branch inside an interpreter loop, _contra_ the conventional wisdom. We take a deep dive into the ITTAGE indirect branch prediction algorithm, which is capable of making those predictions, and draw some connections to some other interests of mine in the areas of fuzzing and reinforcement learning.

Made of Bugs

Quality Control

It’s been suggested that I harp on Apple a little bit too much lately. I guess I have high expectations when it comes to paying a premium price for hardware, and when the software fails to deliver it’s disappointing. Especially since Apple has designed their ecosystem to be completely dependent on their software. If you’re controlling the entire ecosystem, you have little excuse to not provide the highest quality.

In the latest version of iOS, I run across this bug on a daily basis.

This first appears when I go to select an emoji while conversing in Messages. I have to dismiss the screen and then request the emojis again to get emojis to actually appear. I don’t have any plugins, I have reduced my iPhone to a fairly minimal experience, yet, things like this happen on a daily basis.

When you’re 18 versions into an OS and one of your main selling points is emojis and their derivatives, it should not be a buggy experience. End of story.

There’s a rumor out there that Apple is going to start versioning their software after the upcoming year, which personally I find to be idiotic, but apparently that’s what they’re going to do. If they wanted to tie the version of software to the year, logic would dictate that should be tied to the year the software is released, not the following year. Ubuntu Linux does this, for example, the latest version of Ubuntu is 25.04, because it was released in April 2025. That makes sense to me.

To number the version of operating system based on the upcoming year is a marketing maneuver, not a logical choice. This also locks Apple into releasing a new version every year, which is something they’ve been doing for the past decade or so. Now they’re locked into that mindset, and they’ll be releasing new versions of all their operating systems whether they’re ready for release or not. We know Apple has gotten in the habit of over promising new features and just slapping “beta” on stuff when they know it’s not ready for production release, but this new numbering scheme will probably amp up the buggy experience by a few notches.

Apple doesn’t need that right now. Apple needs to get back to innovation and making their premium priced experiences as premium and bug free as possible.

When you control the entire ecosystem (hardware and software), there’s little excuse for buggy software.

I’m really disappointed that Apple has become such a blatant marketing driven company. I was hoping they were better than that in the long run, but no.

#apple #mac #softwareBugs #userExperience #UX

Why Property Testing Finds Bugs Unit Testing Does Not

I intended this newsletter to be my thoughts without editing, and I have a new thought, so here goes. I want to respond to this discussion: But Kids These...

Computer Things
The Glitch Gallery

An online exhibition of pretty software bugs, a museum of accidental art. Open for submissions!

How a 20 year old bug in GTA San Andreas surfaced in Windows 11 24H2

After over two decades, players are now forbidden from flying a seaplane, all thanks to undefined code behavior.

Silent’s Blog
Oh look, it's another genius idea: deliberately inject bugs 🐞 and cross your fingers that your tests just, you know, magically catch them. Because who needs stable code when you can have a zombie apocalypse of software errors? 🧟‍♂️🔧
https://github.com/sourcefrog/cargo-mutants #geniusideas #softwarebugs #testingfailures #codingnightmare #techhumor #HackerNews #ngated
GitHub - sourcefrog/cargo-mutants: :zombie: Inject bugs and see if your tests catch them!

:zombie: Inject bugs and see if your tests catch them! - sourcefrog/cargo-mutants

GitHub

Mashable: An iOS 18.4 bug seems to be resurrecting zombie apps for some users . “If old, long-deleted apps are showing up on your iPhone after you’ve upgraded to iOS 18.4, you’re not alone. Several reports on Reddit, as well as on Apple’s Community hub, talk about apps being installed on user’s iPhones after the update. In many cases, these are apps that were once installed on the phone, but […]

https://rbfirehose.com/2025/04/04/mashable-an-ios-18-4-bug-seems-to-be-resurrecting-zombie-apps-for-some-users/

Mashable: An iOS 18.4 bug seems to be resurrecting zombie apps for some users | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Pew Research Center: How a glitch in an online survey replaced the word ‘yes’ with ‘forks’. “Dating back to at least early 2023, a bizarre and alarming technical glitch – and yes, a hilarious one – started popping up in some organizations’ online surveys and forms, including our 2024 survey. A few Reddit users shared screenshots from a variety of surveys, where questions that […]

https://rbfirehose.com/2025/03/26/pew-research-center-how-a-glitch-in-an-online-survey-replaced-the-word-yes-with-forks/

Pew Research Center: How a glitch in an online survey replaced the word ‘yes’ with ‘forks’ | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz