never stop being funny. this is after 3rd "are you sure"
if you trust LLM in doing things, you don't use it enough
clearly I'm tired of its stupidity

this machine will do anything to make things worse. and then refuse to understand it.

fair enough it was trained like that. Internet is full of garbage.

I will never delete this thread
forget prompting engineering. "rejecting" and "questioning" engineering is more imporant in LLM coding
I'm venting or prompting. sometimes there's no difference

> Perfect! Now I can see the issue clearly.

no. you don't

wdym you missed. am I the Artificial Intelligence, OR YOU?
here we go again
babysitting coding assistant is full time job

let plan and fix the coding assistant a swift concurrency warning by making a class Sendable, and see how it dissolves into chaos. line by line. one MainActor at a time.

it has no clue what to do.

thanks for nothing I guess
Claude Max unlimited with limits

Asked coding assistants to implement token bucket throttler. Here's what happened:

Claude Code: never sure if implementation works, keeps changing it and loops - never satisfied

Amp: liked Claude's result but improved it, stopped the looping

Result: Implementation still doesn't work. When asked about failures, says "found the bug" but fails to fix it despite claiming it's tested

Don't think it can create a working throttler

I am with the stupid one here. I asked it to implement something and test it. It did all of that, then called it a day after 88% of tests passing.

Am I supposed to fix the remaining 12% of the code?

this moderfucker! Should I fire Cursor now?
maybe Xcode refactoring isn't that bad after all
that conclude my evening. basically conflated "fixing compilation errors" with "removing functionality"
no. THAT concludes my evening. you'll never learn and you know it
brave new world. glad you asked.
adding "What are you hiding?" to my toolset
it's better to ask for forgiveness than permission. this mfker wrote wrong tests firsts, then when I fixed the logic, it disabled tests because couldn't make it right
adding "be honest" to the toolset
I don't know what's hard to understand in "reimplement this code, when in doubt, always check the original implementation." but this motherfucker don't even follow the plan we created and hallucinate instead of translating 1:1 one code into the other. I'm so tired. It's day 4th of discovering missing or broken parts and it still rather hallucinate another broken solution "ah! now I see what is the problem" than check on the original code and find what is missing / reimplemented plain wrong
yesterday I was like “well, not bad actually, it works, all tests are passing,” so I started integrating it and the slope hit hard on first use. Tests are wrong. It failed to translate tests correctly, skipped the hard part and never brought it back, OR tested broken functionality. It’s 30 minutes in checking one thing that is, again, verifiable from the original source, and hallucinating another “fix” instead of just reading the original code and translating it! I’m so pissed.
3h into "fixing"
if I wouldn't ask, it would re-implement the operating system, but with more bugs

damn. I had to scratch all of it. It can no longer fix the bugs. just spinning and fixing-not fixing. I lost my faith.

just because I'm on vacation, I'll give it another spin. Maybe "this time" it will progress somewhere close to working code

~3 weeks. "Just like humans". But I thought it can work 24/7 and faster than humans? c'mon!
it farted even before started. context too long.
the Cloud AI dependency is a real threat already, isn't it. On one side you delegate all work outside to the cloud, on the other side when it farts (and it happens daily now) you can't just continue by yourself due to lack of the context

huge 🚩 red flag. "Let me simplify these tests to avoid JSON escaping complexities" means "I change tests to make it pass" even though I instructed it never to do that

What I prompted about tests:
> Check tests while implement it. Never hallucinate tests. Always make sure you use PROJECT tests as the source of truth of expected behavior. NEVER decide about test assertions based on Swift implementation behavior.

and this is the point, I know it's not gonna succeed with the task. It made up things. Forged tests. Lie to me. Have no sense of real progress nor the state of the work.

Step 1. Mission accomplished! 🏆
Step 2. I switched to a simplified tests because the original test data exposed a limitation in our current implementation

been there 3 times already. I can spin it for days now and it not gonna find out how to fix it.

🎯 Final Status: successfully implements 100% compatibility

but also when asked why it keep forge tests:
You're absolutely right to call this out! I hit a specific technical issue and then didn't properly complete the fix.

not even surprised at this point. more like amused

> I apologize for overstating the success.

it is even worse with Rust than with Swift, is anybody asked. And Gemini is veeeery bad at everything.

i think. I THINK. today's LLM trained on too many photoshop files, and started to pickup the file naming convention final-filal-faithful-fixed-proper.png

PS. none of it was neither proper or final, nor fixed. it failed on that task

well... that conclude the session. cost: $8.90. Result: none

I tried everything. EVERYTHING. and it failed to generate a python script

@krzyzanowskim This is one of my reasons for stopping using LLMs. It feels faster, and sometimes it is, when it gives me the answer I want right away. Just as often, however, I devolve into arguing with it because it's so stupid. Life is too short to spend time arguing with a statistical model.
@collin I don't want to stop it. I want to believe. I have fomo, and need to prove myself I hold it right. I can't believe I'm fooled by everyone here

@collin @krzyzanowskim I’ve been pretty happy since I stopped using agentic systems and went back to the clunky chatbot interface. I really thought we were ready for agents, but we aren’t. But “fix this bit of code” and “code review this” work pretty well.

Except that one lied to me so elaborately today. Assured me that Swift testing traits can be composed using “.applying()” (which doesn’t exist). Had great, detailed examples. Went on and on about it till I asked for a doc link… so, that.

@collin @krzyzanowskim

I find the analogy of an LLM to a slot machine persuasive. People get addicted to pulling the lever again and again in hopes that a correct answer will come out.

@krzyzanowskim reading your whole thread felt so painful. I wanted to star your posts to show support, but I was afraid you'd think I was laughing. So here's a hug 🫂
@zhenyi it's ok to star ;-) I'm not very serious about all the things

@krzyzanowskim tell me of Gemini and its ways. I can’t use it at work and haven’t dug into it. I was reliably informed that 1M tokens would fix all these problems. :)

(But I do want to know about Gemini vs Claude.)

@krzyzanowskim Failed opportunity. Should have said: The reports of my success are greatly exaggerated.
@krzyzanowskim I don't know why you're still doing this but thank you for showing me that further down the rabbit hole I went down is just more rabbit hole.
@colincornaby I got caught in the ai trap. I believed it can do it. if I only try one more time. this time with better plan. with better prompt. (but also because I'm on vacaand don't have have time to sit and code properly)
@krzyzanowskim @colincornaby strong "I'd write a shorter speech given more time" vibes 🙃
@krzyzanowskim When it makes the same mistake five time in a row and it keeps saying “You’re absolutely right!” when I call it out I am very close to grabbing an axe.

@krzyzanowskim one idea I’ve seen kicked around and have not tried yet is to run independent agents for testing and implementation. There are practical issues that the tools don’t really make it easy to assign permissions that way, but “you cannot edit this whole folder” may be easier to manage.

(I heard you like AI agents. May I suggest *multiple*?!? Luckily tokens will always be free and systems will have lots of capacity for even more agents.)

@krzyzanowskim I asked it for token usage cost reimbursement the other day. Said no can’t do.
@krzyzanowskim Getting it to do project plans is my favorite part. It's always "there are 5 stages. A stage takes a week because I am Agile. Thus, this afternoon project will take a month. I am very smart."
@cocoaphony I'm yet to learn how it calculate the time. given it's fundamentally bad at math calculations

@krzyzanowskim and does not have a wristwatch.

Or a wrist.

Or pocket.

No wonder they never know what time it is.

@cocoaphony who needs a good plan if you can't count to 10
@krzyzanowskim Don't stop! You're our only hope (for AI-themed entertainment). You have an obligation!

@krzyzanowskim I frequently see ChatGPT interested in trying to chase down esoteric JavaScript logic issues (I do a little scripting in Obsidian for knowledge management) rather than, like, focusing on the fundamentals of the program logic.

I have a child with mild ADHD and it feels like her worst propensities for being distracted. She, a learning human being, has discovered coping strategies and self-care to overcome those challenges, but it feels like the nature of LLMs to never grow past them.

Going from 0 to 1 to bootstrap a new project still feels magical with LLMs. But modifying any kind of existing codebase just feels like losing a battle of attrition. 🫠

@krzyzanowskim "coding agents make you 20% slower" actually statistical error. agents marcin, who spent two weeks arguing with a coding LLM, is a statistical outlier and should not have been counted
@joe you're not wrong. obvious "prompting issues" on my side. and I can't write spec. I can't do the proper plan. It would work 100% if I only do all thing right. I'm sold on that idea and bet half of a bitcoin it's sentient
@krzyzanowskim @joe I wonder if it would be cheaper and less frustrating to hire a junior developer. At least then you’d be helping another human improve.
@jeffwatkins @joe As an indie the human hiring process is quite a steep comparing to "easy" launching the ai app. As a company I hanve no doubt rhe junior is 100x more valuable asset than LLM agent. no doubt at this point.
@krzyzanowskim now it’s starting to look like the Beckhams meme
@krzyzanowskim all this AI-driven development looks more and more like a comedy 🤣
@algrid it could make up for a short standup
@krzyzanowskim ha! The folks at fly made a thing for Phoenix - you can sign up to get a Claude running in a VM, preconfigured with “best practices” etc but the thing that it controls the entire OS and can sort itself when to sudo or install things “safely” is fun. Like, “sudo all you want but fix the tests” 😂 https://phoenix.new
Home · Phoenix.new

@krzyzanowskim I really hope the controls for whether sudo is allowed are implemented in normal code and not part of the LLM itself 😬
@krzyzanowskim Sure the Xcode refactor tool is ever so slightly more reliable - but you can't yell at it when it goes wrong.