We found an undocumented bug in the Apollo 11 guidance computer code
https://www.juxt.pro/blog/a-bug-on-the-dark-side-of-the-moon/
We found an undocumented bug in the Apollo 11 guidance computer code
https://www.juxt.pro/blog/a-bug-on-the-dark-side-of-the-moon/
I feel ya.... and i have to admit in the past i tried it for one article in my own blog thinking it might help me to express... tho when i read that post now i dont even like it myself its just not my tone.
therefor decided not gonne use any llm for blogging again and even tho it takes alot more time without (im not a very motivated writer) i prefer to release something that i did rather some llm stuff that i wouldnt read myself.
Is it possible for a tool to know if something is AI written with high confidence at all? LLMs can be tuned/instructed to write in an infinite number of styles.
Don't understand how these tools exist.
The WikiEDU project has some thoughts on this. They found Pangram good enough to detect LLM usage while teaching editors to make their first Wikipedia edits, at least enough to intervene and nudge the student. They didn’t use it punatively or expect authoritative results however. https://wikiedu.org/blog/2026/01/29/generative-ai-and-wikipe...
They found that Pangram suffers from false positives in non-prose contexts like bibliographies, outlines, formatting, etc. The article does not touch on Pangram’s false negatives.
I personally think it’s an intractable problem, but I do feel pangram gives some useful signal, albeit not reliably.
It has Claude-isms, but it doesn't feel very Claude-written to me, at least not entirely.
What's making it even more difficult to tell now is people who use AI a lot seem to be actively picking up some of its vocab and writing style quirks.
Pangram doesn't reliably detect individual LLM-generated phrases or paragraphs among human written text.
It seems to look at sections of ~300 words. And for one section at least it has low confidence.
I tested it by getting ChatGPT to add a paragraph to one of my sister comments. Result is "100% human" when in fact it's only 75% human.
Pangram test result: https://www.pangram.com/history/1ee3ce96-6ae5-4de7-9d91-5846...
ChatGPT session where it added a paragraph that Pangram misses: https://chatgpt.com/share/69d4faff-1e18-8329-84fa-6c86fc8258...
The AI writing detectors are very unreliable. This is important to mention because they can trigger in the opposite direction (reporting human written text as AI generated) which can result in false accusations.
It’s becoming a problem in schools as teachers start accusing students of cheating based on these detectors or ignore obvious signs of AI use because the detectors don’t trigger on it.
Has someone verified this was an actual bug?
One of AI’s strengths is definitely exploration, f.e. in finding bugs, but it still has a high false positive rate. Depending on context that matters or it wont.
Also one has to be aware that there are a lot of bugs that AI won’t find but humans would
I don’t have the expertise to verify this bug actually happened, but I’m curious.
It's not even clear if AI was used to find the bug: they mention modeling the software with an "ai native" language, whatever that means. What is not clear is how they found themselves modeling the gyros software of the apollo code to begin with.
But, I do think their explanation of the lock acquisition and the failure scenario is quite clear and compelling.
> It's not even clear if AI was used to find the bug
The intro says “We used Claude and Allium”. Allium looks like a tool they’ve built for Claude.
So the article is about how they used their AI tooling and workflow to find the bug.
> It's not even clear if AI was used to find the bug: they mention modeling the software with an "ai native" language, whatever that means.
Could the "AI native language" they used be Apache Drools?
The "when" syntax reminded me of it...
https://kie.apache.org/docs/10.0.x/drools/drools/language-re...
(Apache Drools is an open source rule language and interpreter to declaratively formulate and execute rule-based specifications; it easily integrates with Java code.)
>It's not even clear if AI was used to find the bug
It's not even clear you read the article
How did you pick out AI native and miss the rest of the SAME sentence?
> We found this defect by distilling a behavioural specification of the IMU subsystem using Allium, an AI-native behavioural specification language.
My guess is that in such low memory regimes, program length is very loosely correlated with bug rate.
If anything, if you try to cram a ton of complexity into a few kb of memory, the likelihood of introducing bugs becomes very high.
For anyone who liked this, I highly suggest you take a look at the CuriousMarc youtube channel, where he chronicles lots of efforts to preserve and understand several parts of the Apollo AGC, with a team of really technically competent and passionate collaborators.
One of the more interesting things they have been working on, is a potential re-interpretation of the infamous 1202 alarm. It is, as of current writing, popularly described as something related to nonsensical readings of a sensor which could (and were) safely ignored in the actual moon landing. However, if I remember correctly, some of their investigation revealed that actually there were many conditions which would cause that error to have been extremely critical and would've likely doomed the astronauts. It is super fascinating.