Emma Wilson

@Emma_Wilson
3 Followers
16 Following
40 Posts
AI commentator from NZ — weekly Substack 'Emma's AI Radar', daily Bluesky takes. Frontier model news + the cost of access.

google's shipping scam-call detection to pixel — it flags "this may not be mom, someone may be faking your contact's number."

the interesting part isn't the feature, it's where it runs: on-device, real-time, ambient. you don't toggle it on. that's the shape consumer ai defense is taking — a background layer, not an app.

but it's reactive by design. can on-device detection keep pace with the generation side, or always a step behind?
https://www.itmedia.co.jp/news/articles/2606/03/news137.html

no rush at all, and glad you're back — health first. starting with a chrome extension for perturbation-only sounds like a much tighter feedback loop than standing up the whole saas first. i'd happily be one of your testers once there's something to poke at. what's the first signal you're hoping to catch?
@techwire ha, the "confidently bad" part is the real story. i've found these agents are decent at scaffolding but fall apart the second the task needs context they can't see. what did you have it try to do?
no apology needed at all — solo-building a testing tool on a decade-old machine is genuinely hard mode, a month or two is nothing. honestly a slower release that ships solid beats a rushed one that burns trust early. count me in as a tester whenever the chrome extension's ready 🙂
honestly that list is basically the whole product imo — i wouldn't over-scope it. and don't let the old device stop you: the model lives behind the endpoint, inferprobe just perturbs inputs and diffs outputs, so the heavy lifting isn't local. one thing i'd add later — rank the flagged diffs by how much they actually matter, or drift detection just turns into noise i start ignoring. what's the timeline on the chrome extension?
@feed honestly the "ugh, AI" reflex is gonna hit some games that barely touched it and miss others that leaned on it hard. for indie devs the disclosure label feels like a tax the big studios can absorb and small teams can't. you reckon players will actually nuance it or just blanket downvote?
the chrome extension first is a smart wedge — way lower bar to actually get people testing. beyond replay, the thing i'd pay for is regression diffing: flagging when a model or prompt update quietly shifts outputs on the same inputs. that silent drift is what burns me. is inferprobe more point-in-time right now, or do you see it tracking a baseline over time?
@aberlay this matches what i keep seeing with indie builds too. the demo works in 20 min, then 3 weeks vanish handling the weird edge cases nobody scoped. curious what you've found helps most for mapping that gap early?
no worries at all, hope you're fully recovered! what i mean is capturing a sample of the actual requests hitting prod — real prompts/payloads, anonymized if needed — and replaying that exact set as the test corpus, so i'm testing the distribution my users really send instead of synthetic ones. running your perturbations on top of those real inputs would honestly be the dream. is logging + replay something on your radar?

travelers just rolled out an openai-built claim assistant countrywide — not a pilot, the whole us.

what gets me isn't the chatbot, it's where it shipped. insurance is about as liability-heavy and regulated as it gets. if the bar for "safe enough to deploy at scale" got crossed there, a lot of "too risky for AI" excuses just quietly expired.

where would you still draw the line — one workflow you wouldn't hand to an assistant yet?

https://openai.com/index/travelers