Mastodawn

Jef Poskanzer 16h ago

This entire report from the Ontario government on genAI systems is worth a read, but the review of healthcare scribe accuracy is pretty devastating, imo. This has to work for the tech to be worth anything. If the notes in the chart are wrong, the whole thing falls apart.

https://www.auditor.on.ca/en/content/specialreports/specialreports/en26/2026_AI_EN.pdf

I get to see this in action. Doctors want transcription and summarization services because of the challenges they face getting quickly familiar with a patient in a crazy short amount of time. They also want to automate notetaking for rounds, which can be chaotic. Problem is, these tools suck in chaotic situations, and even in relatively normal ones, hallucination abounds.

There will always be a claim of human review, but I know all too well that it's working against the current to have a human reviewer not assume the model got it right. What's more, those safeguards will eventually be seen as cost centers and redundancies—well, at least until the lawsuits.

One other thing. As noted above, these model-generated fields in charts are a) being used as training material for other models, and b) being used as input for other generative tools without human review. The potential for compound errors and model collapse is immense.

Raymond Neilson 16h ago

@mttaggart Family full of medical people, and can confirm they're all desperate for reliable transcription. That's what they've been sold, and it's not like they typically have time to spare to go over their own recordings to check WER or validate the summaries.

Some recognize the problem, though:

https://cosocial.ca/@delta_vee/116581810079302048

Raymond Neilson (@[email protected])

Had a good moment with my family doc today: Doc: <typing furiously on laptop> Me: thank you for not using the AI transcription thing Doc: I HATE the AI transcription thing! I need to use my brain! I don't want the computer doing my job for me! Me: YES! (c.f. https://arstechnica.com/health/2026/05/your-doctors-ai-notetaker-may-be-making-things-up-ontario-audit-finds)

CoSocial

The Bird 16h ago

@mttaggart And each errors and incorrect info increases the likelihood of a patient dying due to it. Yuck. This is not the way.

tschenkel 16h ago

Would it be a legal problem to record all sessions with GP and other healthcare professionals myself, so I have a record in case the AutoScribe gets it wrong? Should also be in the GPs interest, should an AutoScribe mistake lead to malpractice.

Jens Bannmann ⁂16h ago

@tschenkel @mttaggart
Can't speak for other regions, but as an example: in Germany (and probably other/all EU countries) it would be totally illegal unless you get explicit consent from the people being recorded.

And while I can somewhat understand the argument that it's theoretically in the best interests of the medical personnel to say yes, I doubt that here in Germany, many would agree.

Reed Mideke 18h ago

@mttaggart The minister responsible for that AI Scribe project's explanation of why it was all OK was incredibly bad https://mastodon.social/@reedmideke/116570464172955876

Deborah Preuss, pcc 🇨🇦1h ago

@reedmideke "just makin' s*** up" seems to be the new job description. 😥

Bill, organizer of stuff 18h ago

@mttaggart It's already too late. This garbage is not only already in our medical records, but it's also being used to train the next generation of models, which will coupound (and hide) the issue!

⠵⠻⠷⠕⠭ 🍥🍉⚪🌹17h ago

@mttaggart Medicine should probably stick to machine learning pattern recognition in diagnostics, that seems useful? from what I've read?

Shaula Evans 17h ago

@z3r0fox @mttaggart If you dig into it deeper, it is also problematic.

John Smith 11h ago

@z3r0fox @mttaggart this has a long history of bad and discriminatory models and there are real concerns they will just worsen healthcare disparities. Combined with the fact that many of the papers the models are undoubtedly based on are guaranteed to be hot garbage.

keithzg 17h ago

@mttaggart I was cynically thinking to myself "and what are the chances that an industry-loving institution like the Ontario government had any conclusion other than 'well we'll just choose to use Good AI and that will be fine', probably 100%" and jumping to the report's conclusions,

establish KPI targets to measure and track Microsoft Copilot Chat’s adoption
take actions to increase use of Microsoft Copilot Chat to the targeted rates and usage in the OPS

educate OPS staff through AI training about the dangers of using non-Microsoft browsers when accessing AI websites

So, yeah, they did an audit showing LLMs are wildly unreliable and . . . concluded they should encourage use of Microsoft LLM products.

Their audit criteria also included "having due regard for economy".

⠠⠵ avuko 17h ago

@mttaggart we all want the autodocs we were promised in sci-fi, but genAI is not that.

PS: medicine is already a field suffering heavily from biases. Adding automated bias at scale is gonna literally kill millions more of us.

@mttaggart and the evaluation noted in figure 7 there was despite the audio made available to the vendors, who then provided the results, if I'm reading the report correctly. Not even a live demonstration.

Davi Ottenheimer 17h ago

@mttaggart their audit reads like a Wirken requirements document written by someone who did not know Wirken existed. I do five or six calls on this a week now, which is why I open-sourced and started giving Wirken away for free. I've updated the marketing copy here, but I'll soon release a line-by-line response to the Ontario audit: https://gebruder.ottenheimer.app/wirken

Wirken — Gebrüder Ottenheimer

Kyle Memoir 🍉🐧16h ago

The Ford government's approach to slop looks, well... sloppy.

At a time when OPS staff morale is at an all-time low and the relationship of government members and their staff to what was once a model public service remains one of ideological hostility.

@mttaggart Wait a moment: they used Microsoft Defender to assess that Microsoft Copilot is the only safe LLM to use? And all of that because they payed for Microsoft services?
It's in the first pages.
Maybe the document dates back to April 1st. Maybe it's just one of the usual jokes spreading on the net, and I never get them.

Paul Kuyper 🇺🇸1h ago

@luc0x61 @mttaggart I noticed this same thing! They used a Microsoft product to verify that the safest AI tool is Microsoft’s AI tool? Is this a joke?

@mttaggart Congratulations, medical industry. You have have made your doctors work harder and made more people sick. Shining example

Sassinake! ᑐ ∪ ∩ ⊂14h ago

was the machine EVER right?

@mttaggart Skimmed this... But on top of your screenshot, 60% also recorded the wrong prescribed drug. Why are none of the recommendations not to use AI systems that cannot accurately and precisely perform the required job? One of the recommendations is actually to *increase* AI use, despite the aforementioned evidence that 25-60% of the information the scribes provide being useless.

@glnfld Yeah the idea that this is a fool's errand is never examined.

Jay 🆘30m ago

@mttaggart @glnfld For three plus years, the assumption that AI is *just about* to get accurate and stop hallucinating has persisted.

What is this assumption based on?

Greg Bell 14h ago

@mttaggart "the whole thing falls apart" is of course a built-in endpoint for any bezzle or confidence scam

Idaho buckeye 14h ago

@mttaggart just another scam for the tech gods to destroy society

John Smith 11h ago

@mttaggart what things like this miss (purely with respect to accuracy not privacy etx) is not the absolute data but the relative data. The question is not "how often does ai get it wrong" but "is AI better worse than humans already pretty bad note taking" (the kind of thing that comes up in eg EMR poetry).

François Lemaire 🏴‍☠️6h ago

@NobodyElse47263

@mttaggart I was expecting this kind of information from the "audit". Without it, no good decision can be made. Of course, it is not there 🤬

Space Invader 10h ago

@mttaggart I liked* the bit where they only gave accuracy of transcription a 4% weight but made-in-Ontario was 30% of the score when determining which tool was best for purpose.

* hated, but appreciated the accidental honesty

Juno Jove 8h ago

@mttaggart I think this technology is good for a casual, routine meeting at a company that's making products where no lives depend on them.

Errors correct themselves.

To use this in diagnostic and medical care is beyond reckless.

Errors in these fields cause suffering and death.

Jeff McNeill 7h ago

@mttaggart Good audit, mostly agreed to buy the agency.

Arnd Layer 6h ago

@mttaggart
And they’re using weasel words in their results document.
These are not ‚inaccuracies‘. These should all be called failed tests / software errors that need to be fixed.

You know who must love this result? Attorneys for malpractice plaintiffs.

With the chart no longer an authoritative, or even plausible, record of what happened, a malpractice lawsuit will just come down to who can tell the jury a better story.

John Burns 1h ago

This is what testing in production gets you!