AI chatbots were tasked to run a tech company. They built software in under seven minutes — for less than $1.
AI chatbots were tasked to run a tech company. They built software in under seven minutes — for less than $1.
This is the best summary I could come up with:
AI chatbots like OpenAI’s ChatGPT can operate a software company in a quick, cost-effective manner with minimal human intervention, a new study has found.
Based on the waterfall model — a sequential approach to creating software — the company was broken down into four different stages, in chronological order: designing, coding, testing, and documenting.
After assigning ChatDev 70 different tasks, the study found that the AI-powered company was able to complete the full software development process “in under seven minutes at a cost of less than one dollar,” on average — all while identifying and troubleshooting “potential vulnerabilities” through its “memory” and “self-reflection” capabilities.
“Our experimental results demonstrate the efficiency and cost-effectiveness of the automated software development process driven by CHATDEV,” the researchers wrote in the paper.
The study’s findings highlight one of the many ways powerful generative AI technologies like ChatGPT can perform specific job functions.
Nevertheless, the study isn’t perfect: Researchers identified limitations, such as errors and biases in the language models, that could cause issues in the creation of software.
The original article contains 639 words, the summary contains 172 words. Saved 73%. I’m a bot and I’m open source!
The study said 86.66% of the generated software systems were “executed flawlessly.”
But…
Nevertheless, the study isn’t perfect: Researchers identified limitations, such as errors and biases in the language models, that could cause issues in the creation of software. Still, the researchers said the findings “may potentially help junior programmers or engineers in the real world” down the line.
That’s a B+! Fire all our engineers immediately.
As someone that uses ChatGPT daily for boilerplate code because it’s super helpful…
I call complete bullshite
The program here will be “hello world” or something like that.
Seriously?
If I google for example:
how to do loops in c#
The first result is www.w3schools.com/cs/cs_for_loop.php
In the time it took me to get to that ChatGPT would still be writing its reply.
Right, but you can't give it the variable names you're using and have it fill them in, and if you want to do something inside that loop with
I can ask ChatGPT "Write me a loop in C# that will add the variable value_increase to the variable current_value and exit when current_value is equal to or greater than the variable limit_value, with all the variables being floats"
You won't find that answer immediately on the Internet, and you're more likely to make errors synthesizing the new syntax.
But you do you, I'll keep using ChatGPT and looking like a miracle worker.
If writing simple loops with ChatGPT makes you a miracle worker then you might have other problems than AI.
And even simple things break down when you ask it about using library functions (it likes to hallucinate heavily there).
I’m a senior software developer (Currently .NET backend with DevOps). Writing code is probably less than 10% of my work day. And in that 10% Visual Studio autocomplete does most of the typing. It’s frequently wrong, but it’s good enough plenty of the times.
Actually working on software consists of writing specifications, security concerns, architecture, talking management out of dumb decisions, having meetings with stakeholders or other companies, working on automatic deployments, writing unit and integration tests, refactoring, performance optimizations, database migrations, bugfixing, …
Green field writing new code is rare and that’s mainly what AI can do (80% correct, maybe). Most of real programming work happens on existing code.
I'm not saying AI will write entire applications, but it is really useful at writing small bits of code for a human being to assemble which can greatly improve productivity.
Though if we could get it to handle stakeholder meetings I'll never use it for programming again.
Right, but you can’t give it the variable names you’re using and have it fill them in, and if you want to do something inside that loop with
Why are you actively trying to avoid learning how to write the loop? Are you planning to have ChatGPT fill in your loop templates for the rest of your life?
But you do you, I’ll keep using ChatGPT and looking like a miracle worker.
It’s going to be slower overall than just using the reference and learning how to do it. I really, really am skeptical that a developer at the level where they need that feature is going to seem like a miracle worker to anyone other than people who are just impressed when you can do anything with a computer.
Why are you actively trying to avoid learning how to write the loop? Are you planning to have ChatGPT fill in your loop templates for the rest of your life?
First, how is this different from having your IDE fill in your loop templates?
Second, no, of course I learn how to do it and then copy/paste from my existing code like a normal person.
Third, this is much more customizable. The example I gave is pretty simple, but you can explain algorithms to ChatGPT and have it figure it out.
Finally, I'm usually doing this for a customer in a language I'll never use again. Last week it was LabView. My role has me writing proofs-of-concept for customers frequently so I'm not going to learn something I'll never use again.
It’s going to be slower overall than just using the reference and learning how to do it.
Not when you're not familiar with the syntax and don't have an IDE set up for it.
other than people who are just impressed when you can do anything with a computer.
This happens in my job a lot more than I'm comfortable with.
First, how is this different from having your IDE fill in your loop templates?
I don’t do that actually, but I think there are some differences.
That said:
I’m usually doing this for a customer in a language I’ll never use again.
Maybe you’re the one in a million exception where this approach is a benefit. Most of the time when you talk to people on the internet, they’re going to assume you’re a reasonably typical case and not the extremely rare exception.
Absolutely I can create a code for your app.
void myApp(void) { // add the code for your app here; Return true;You may need to change the code above to fit your needs. Make sure you replace the comment with the proper code for your app to work.
sudo void… (:
The difficult part of software development has always been the continuing support. Did the chatbot setup a versioning system, a build system, a backup system, a ticketing system, unit tests, and help docs for users. Did it get a conflicting request from two different customers and intelligently resolve them? Was it given a vague problem description that it then had to get on a call with the customer to figure out and hunt down what the customer actually wanted before devising/implementing a solution?
This is the expensive part of software development. Hiring an outsourced, low-tier programmer for almost nothing has always been possible, the low-tier programmer being slightly cheaper doesn’t change the game in any meaningful way.
While I do agree that management is genuinely important in software dev:
If you can rewrite the codebase quickly enough, versioning matters a lot less. Its the idea of “is it faster to just rewrite this function/package than to debug it?” but at a much larger scale. And while I would be concerned about regressions from full rewrites of the code… have you ever used software? Regressions happen near constantly even with proper version control and testing…
As for testing and documentation: This is actually what AI-enhanced tools are good for today. These are the simple tasks you give to junior staff.
Conflicting requests and iterating on descriptions: Have you ever futzed around with chatgpt? That is what it lives off of. Ask a question, then ask a follow up question, and so forth.
I am still skeptical of having no humans in the loop. But all of this is very plausible even with today’s technology and training sets.
If you just let it do a full rewrite again and again, what protects against breaking changes in the API? Software doesn’t exist in a vacuum, there might be other businesses or people using a certain API and relying on it. A breaking change could be as simple as the same endpoint now being named slightly differently.
So if you now start to mark every API method as “please no breaking changes for this” at what point do you need a full software developer again to take care of the AI?
I’ve also never seen AI modify an existing code base, it’s always new code getting spit out (80% correct or so, it likes to hallucinate functions that don’t even exist). Sure, for run of the mill templates you can use it, but even a developer who told me on here they rely heavily on ChatGPT said they need to verify all the code it spits out, because sometimes it’s garbage.
In the end it’s a damn language model that uses probability on what the next word should be. It’s fantastic for what it does, but it has no consistent internal logic and the way it works it never will.
You are literally describing constraints. They can be applied to an LLM the same way they can be applied to a dev team. And if you have never had to report an API change that breaks functionality… I wish I was you.
And if your full time software engineers are just running a unit test suite all day? … are you hiring?
As for modifcations: Again, have you ever used an LLM? Have a conversation with chatgpt. It will iterate on its responses. That is iterating on code.
In the end it’s a damn language model that uses probability on what the next word should be. It’s fantastic for what it does, but it has no consistent internal logic and the way it works it never will.
And that is demonstrably false and mostly just highlights that you don’t know what you are talking about. Or what language is, for that matter.
Mate, I’ve used ChatGPT before, it straight up hallucinates functions if you want anything more complex than a basic template or a simple program. And as things are in programming, if even one tiny detail is wrong, things straight up don’t work. Also have fun putting ChatGPT answers into a real program you might have to compile, are you going to copy code into hundreds of files?
My example was public APIs, you might have an endpoint /v2/device that was generated the first time around. Now external customers/businesses built their software to access this endpoint. Next run around the AI generates /v2/appliance instead, everything breaks (while the software itself and unit tests still seem to work for the AI, it just changed a name).
If you don’t want that change you now have to tell the AI what to name things (or what to keep consistent), who is going to do that? The CEO? The intern? Who writes the perfect specification?
Yes. ChatGPT is not perfect. Because it is a general purpose LLM. Stuff like Github CoPilot and other software specific approaches are a LOT better at avoiding all the noise from bad answers on stack overflow and proposals.
But it can still do a remarkably good job so long as you have a human looking at it after the fact. Which… is how I would describe most software engineers I have ever worked with. Even the SSEs need someone to review their code. Which… is what is being described here. Combine that with a gitlab runner and you got yourself a stew.
As for APis and the like: Again, it feels like nobody here has ever actually worked with public software and think regressions don’t exist. But this is literally constraints and would be put in the requirements document that you give either the dev team or the LLM.
As for who is going to make that document: The same people who already do? Management.
Management and sound technical specifications, that sounds to me like you’ve never actually worked in a real software company.
You just said what the main problem is: ChatGPT is not perfect. Code that isn’t perfect (compiles + has consistent logic) is worthless. If you need a developer to look over it you’ve already lost and it would be faster to have that developer write the code themselves.
Have you ever gotten a pull request with 10k lines of code? The AI could spit out so much code in an instant, no developer would be able to debug this mess or do a code review. They’ll just click “Approve” and throw it on the giant garbage heap whatever the AI decided to spit out.
If there’s a bug down the line (if you even get the whole thing to run), good luck finding it if no one in your developer team even wrote the code in the first place.
Management and sound technical specifications, that sounds to me like you’ve never actually worked in a real software company.
Worked at quite a few. Once you get out of college and start engaging with companies beyond “Ugh, how dare they want me to waste my precious time by talking to people” you start to learn the value of a strong management team.
And, more importantly, where those jira tickets come from.
A bog standard development flow is “all pull requests are linked to a documented issue/ticket. All pull requests require tests to pass, code coverage to not decrease, and approval by a code owner”
How does that work in reality?
Issues/tickets (just going to say issues from here on out) are created by a combination of customer feedback, identified issues by the development team, and directives from on high (which is generally related to the overall roadmap). One or more developers work on a merge request, the person who best understands the appropriate code looks it over, it is tested, and it is merged in. After enough of those cycles happen, a release is prepared and a manager signs off on it.
How does that map to an “AI” based workflow?
Issues/tickets (just going to say issues from here on out) are created by a combination of customer feedback, identified issues by the development team, and directives from on high (which is generally related to the overall roadmap). Because LLMs can provide feedback and uncertainty measurements once you get past Google Bard. And regression testing and nightly performance testing can highlight deficiencies. The issue is put into a template, that includes all existing constraints, and the LLM generates a solution. Someone who understands the code checks to make sure that looks sane, it is tested, and it is merged in. After enough of those cycles happen, a release is prepared and a manger signs off on it.
And then it becomes a question of what level you start requiring humans. Because when I do a code review prior to a Release? I am relying VERY heavily on my team to have been doing their due diligence. I skim through the MRs and look for a few hot spots but it is mostly “Well, Fred and Nancy said this was good and it passes all the tests so…”
You just said what the main problem is: ChatGPT is not perfect. Code that isn’t perfect (compiles + has consistent logic) is worthless. If you need a developer to look over it you’ve already lost and it would be faster to have that developer write the code themselves.
I VEHEMENTLY disagree with this. If you don’t have developers looking over your code then you are not a software engineer. And if it takes them the same amount of time to review code as it does to write it? You aren’t working on interesting problems and are wasting vast amounts of money.
I can farm out a general task of “improve our code coverage” to an intern. They can spend a few days (or even weeks) doing that, and I can review their MRs in a few minutes. If something looks weird, I leave a comment and wait for them to get back to me. All the time I am working on much more interesting problems… or doing the same for my SSEs.
You misunderstood, I never said management is worthless. The product managers know what customers want. The product owners keep 8 out of 10 dumb ideas away from the development team. And management again leans on the development team to find out what is actually technically possible and in what time frame.
If management just threw every customer wish into a magic black box to get code out, even if that code was perfect, you wouldn’t have a product. You’d have a pile of steaming crap.
I’ve done plenty of code reviews, they only work if they are small human readable increments. Like they say: A code review of 100 lines might take an hour. A code review of 10000 lines takes thirty minutes.
AI would spit out so much code with missing context for the developer, it would be impossible to properly review.
Again: No
if it takes you the same amount of time to review 10k lines versus write 10k lines? Either you are bad at your job or you aren’t working on a meaningful problem.
And, again, there is no difference between assigning “Implement Feature X” ticket to Stan versus StanAI. If StanAI is writing 500x the amount of code that Stan would? StanAI sucks and needs to be retrained.
And, as it stands? Using tools like CoPilot or even ChatGPT, “StanAI” tends to write more concise AND more readable code. In large part because its training data is weighted by the code that has already gone through code review, was accepted, and may even be part of the production stack on half the planet.
You really don’t get the issue. Give real developers pull requests with 10, 100, 1000 and 10000 lines of changed code. I promise you, 100% that the quality on the latter two pull requests will be abysmal. No matter how good you are as a developer, you can be the best of the best, after a few hundred lines of code you’re unfamiliar with you’ll overlook obvious issues.
And let’s be honest, most developers will try to quickly get it done, read over it, hit the approve button and go back to their own work. This is how it works in the real world.
A small pull request with 10 or at most 100 lines will get a lot more scrutiny where developers actually have the mental capacity to think and reason about the code and its context.
If you let AI write a full system, or even a full module at once, spitting that code out, you’ll get large pull requests. Too large to do a meaningful review. It’s like if I threw you a pull request right now for a software you’re not familiar with and it’s 2000 lines of code. How well do you think you’ll do?
And you know what you say if someone is submitting 10k SLOC in a pull request?
“Hey Fred, document the hell out of this and split it into multiple MRs”.
And if there is no way to accomplish that ticket without it being a 10k SLOC MR? Then it was a bad ticket and whoever made it failed.
Nothing you have described doesn’t apply to humans too. If anything, StanAI is less likely to throw a temper tantrum if I leave a comment on his MR.
A small pull request with 10 or at most 100 lines will get a lot more scrutiny where developers actually have the mental capacity to think and reason about the code and its context.
Hmm. If only there was a way to conserve that “mental capacity” by offloading the more banal tasks. Hmmm
It’s like if I threw you a pull request right now for a software you’re not familiar with and it’s 2000 lines of code. How well do you think you’ll do?
Horribly. I would also make it a point to never use any software you are responsible for again if you think asking someone who doesn’t understand a code base to review the MR.
Either you have no idea what you are talking about or you are a genuinely horrible manager who has been entirely dependent on having a few “rock star developers” to do your job for you. So… yeah.
You can’t have your cake and eat it too. The entire point of AI would be to off-load the development work. You write a specification, throw it into the magic AI box, then get a working code base out.
Why the hell would you invest ten times the amount of organization work to break every feature down into small human sized parts? The AI doesn’t need bite sized tickets like humans do, you can throw a complex 100 page specification at it and get out working code an hour later. But you’ll get out 100k lines of code at once in that case.
You’re treating the AI like a junior developer, give it tiny tickets it can work on, then let a human review the work. The human will do badly because they have no context (they’d have to read the entire specification first, then read the pull request, then try to reason about code that a machine wrote). Reviewing code is always more difficult than writing it, the writing part is easy.
Again. If you are not already breaking down every feature into human sized parts, you are a horrible manager. And you seem hellbent on using a specific use case that you would never use in reality because… Frankenstein Complex?
And you continue to assume that the only people who can review a pull request are outside hires with no knowledge of the codebase or problem at all. Which… again, please never work on anything useful.
I’ll say this: If you actively sabotage your employees, they will fail. It doesn’t matter if that is Stan on the third floor or StanAI in the server room.
Yeah, I’m already quite content, if I know upfront that our customer’s goal does not violate the laws of physics.
Obviously, there’s also devs who code more run-of-the-mill stuff, like yet another business webpage, but those are still coded anew (and not just copy-pasted), because customers have different and complex requirements. So, even those are still quite a bit more complex than designing just any Gomoku game.
I’m already quite content, if I know upfront that our customer’s goal does not violate the laws of physics.
Haha, this is so true and I don’t even work in IT. For me there’s bonus points if the customer’s initial idea is solvable within Euclidean geometry.
Well, as per above, these are extremely complex requirements, so most don’t make for a good story.
One of the simpler examples is that a customer wanted a solution for connecting special hardware devices across the globe, which are normally only connected directly.
Then, when we talked to experts for those devices, we learnt that for security reasons, these devices expect requests to complete within a certain timeframe. No one could tell us what these timeframes usually are, but it certainly sounded like the universe’s speed limit, a.k.a. the speed of light, could get in our way (takes roughly 66 ms to go halfway around the globe).
Eventually, we learned that the customer was actually aware of this problem and was fine with a solution, even if it only worked across short distances. But yeah, we didn’t know that upfront…
They did do management-- They modeled the whole company as individual “staff” communicating with each other: CEO-bot communicates a product direction to the CTO-bot who communicates technical requirements to the developer-bot who asks for a “beautiful user interface” (lol) from the “art designer” (lol).
It’s all super rudimentary and goofy, but management was definitely part of the experiment.
…and it didn’t work.
then a human had to fix it and it took 3x as long to fix it as if it was written by a human originally.
This is the state of AI at the moment. Its a giant time waste.
I've tried to have ChatGPT help me out with some Powershell, and it consistently wanted me to use cmdlets which do not exist for on premise Exchange. I told it as much, it apologized, and wanted me to use cmdlets that don't exist at all.
Large Language Models are not Artificial Intelligence.