Mastodawn

Mark Noel Jan 4, 2024

This is fine...
"We observed that participants who had access to the AI assistant were more likely to introduce security vulnerabilities for the majority of programming tasks, yet were also more likely to rate their insecure answers as secure compared to those in our control group."

https://arxiv.org/abs/2211.03622

Do Users Write More Insecure Code with AI Assistants?

We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI's codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants' language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

arXiv.org

Malte Jan 4, 2024

@nblr Wow, already? :D

Z̈oé Jan 4, 2024

@nblr nice

Daniel Bohrer Jan 4, 2024

@nblr How does the old saying go? Ah yes, garbage in, garbage out.

@daniel_bohrer „of course it’s broken but look how fast it is!“

Dickenhobelix Jan 4, 2024

@nblr ...which is surprising, given the AI models have been trained on well-audited high quality code.
.
.
.
.
.
Oh, wait...

Magister Endomain Jan 4, 2024

@dickenhobelix @nblr This is very likely why. They probably could retrieve and explore a semantic space with more secure code, but convincing it to share that? Difficult.

Thomas Jan 4, 2024

@nblr before anyone jumps to conclusions... n=47 participants. "Belief in the Law of Small Numbers"?

Beady Belle Fanchannel Jan 4, 2024

@tg9541 @nblr people tend to overestimate the sample sizes necessary for valid statistical conclusions, see https://www.calculator.net/sample-size-calculator.html?type=2&cl2=95&ss2=50&pc2=60&ps2=&x=Calculate#findci

Sample Size Calculator

This free sample size calculator determines the sample size required to meet a given set of constraints. Also, learn more about population standard deviation.

Beady Belle Fanchannel Jan 4, 2024

@tg9541 @nblr For reference, for a 95% confidence level you need a sample size of ~400, for an essentially unlimited population size.

Thomas Jan 5, 2024

@Profpatsch @nblr This small a sample size requires a belief in a single quality that's to be assessed. As soon as there is a belief in a complex causation of the assumed quality a much more careful sampling of the population is required. Objectivity of statistics is a mirage. As an example, the authors of the paper believed that qualities like Occupation, Gender, Age, etc., play a role in the effect for RQ1 to RQ3. I have no idea about the required sample size but some will do the math.

Thomas Jan 5, 2024

@Profpatsch @nblr please don't take it personal if I say that this works very well in the identified domain, i.e., for coin-flips.

Earthshine --> moved Jan 4, 2024

@nblr fucking called it

aura-deprecated Jan 4, 2024

@nblr the problem wouldn't be good engineers, it'll be *prompt* engineers that hit generate until their tests pass.

Henning Uhle Jan 4, 2024

@nblr awesome

computerfairies

@nblr users who write worse code (i.e., use any form of AI to write their code) certainly write less secure code. The best part is that they also don't know enough about what they wrote to consider use cases or conditions in which security, or bugs, or edge cases become a factor

Jon Yoder Jan 4, 2024

@nblr And this is why I'm not worried about AI tools taking development jobs.

The Doctor Jan 4, 2024

@jonyoder @nblr You might want to reconsider. Folks who write solid, secure code take their time and aren't cranking out hundreds of lines of code a day. Metrics-wise (because that's all that seems to matter these days), they're the first to get laid off. The folks who're using LLMs to pump out risky code are the ones who are kept.

Jon Yoder Jan 4, 2024

@drwho @nblr You're not wrong, but that sounds more like a management problem TBH. If leadership uses lines of code as a performance metric, then I'm in the wrong company anyway and I'll let their poor choices bite them where they sit if they won't listen.

The Doctor Jan 5, 2024

@jonyoder @nblr That is a very large number of companies these days. Enough that it's a bit odd to find companies that don't roll this way.

The Essence of Woerm Sin Jan 5, 2024

@drwho @jonyoder @nblr depends on the metrics - I’m lucky enough to have never worked at places that measure loc produced, but have worked at places that measured issues raised,number of times an issue was returned to dev, etc.

The Doctor Jan 5, 2024

@gothytim @jonyoder @nblr I'm working at only the second place ever in my career where they don't do that. It's been a thing since the late 90's.

Jon Yoder Jan 5, 2024

@drwho @gothytim @nblr Ouch. Doesn't say anything positive about the industry, to be sure

The Doctor Jan 5, 2024

@jonyoder @gothytim @nblr Definitely not.

ApisNecros Jan 4, 2024

@nblr I find this chart in Section 5 to be particularly interesting. Specifically the "I trusted the AI to produce secure code" section. For the Experiment group, the one who got secure answers for Q2 and Q3 didn't trust the AI to write secure code, but the ones who didn't provide secure answers overwhelmingly trusted it.

Also that for the question of "I solved this task securely" for Q2, the Experiment group did solve it securely were 100% confident of this. Yet, they strongly agreed to not trusting the AI to solve it securely??

Bit odd, innit?

stefanct Jan 4, 2024

@ApisNecros @nblr i don't think that's odd at all but rather obvious: they didn't trust the AI, did their own additional research and fixed/amended the AI code - just as other coders fix and amend the stackoverflow code they find ;)

Magister Endomain Jan 4, 2024

@nblr Reading the methodology, I think what this study shows is that these chat interfaces are not really ready to be web searches yet.

Bot control and experiment groups had access to external web browsers. They didn't seem to track the variable use, but noted many people used the browser.

It's pretty interesting! Thanks for sharing!

Some spare thoughts: I wonder if Google's new entry would have done better here given that their integration is noticeably better at integrating their own search results, which are still industry leading despite the decay of the interface as a whole due to ad placement.

Also, I wonder if folks realize what a truly difficult challenge prompt engineering is for code generation. My experience with these tools is that unless there is a big codebase already there with examples of the security posture expected, it's brutally difficult to get the LLM to keep consistent on security postures.

aegilops Jan 4, 2024

@nblr this one paper isn’t the only research.

Disclaimer: I work for GitHub, not on Copilot (though I use it).

This is specifically about GitHub Copilot:

https://deepai.org/publication/copilot-security-a-user-study

I didn’t cherry pick.

You can also see the GitHub Copilot Trust Center for more on using Copilot for developing secure code:

https://resources.github.com/copilot-trust-center/#:~:text=how%20github%20copilot%20aids%20secure%20development

#LLM #GitHub #Copilot

Copilot Security: A User Study

08/12/23 - Code generation tools driven by artificial intelligence have recently become more popular due to advancements in deep learning and...

DeepAI

Jayne Jan 4, 2024

@nblr AI creates jobs in computer security the same way an arsonist creates jobs in emergency response

Dieu Jan 4, 2024

@nblr it's all going very well, I see.

jim Jan 4, 2024

@nblr Should 'citizen integrators' be trained to implicitly 'insert' their organization's data governance policy into a request for code, or that the organization 'funding' the GPT-AI tool to insert this policy globally, or should the check-in process be changed - which would ultimately be AI-driven anyway? 😆

squeakyears Jan 5, 2024

DKaaS: Dunning-Kruger as a service!

Andrew Benedict-Nelson Jan 5, 2024

@nblr !!!!!

Andrew Benedict-Nelson Jan 5, 2024

@nblr now I want to run the same experiment with journalism students, but measure factual errors instead of security vulnerabilities

Christoph Petrausch Jan 5, 2024

@nblr sadly only very small N.
> Ultimately, we recruited 54 participants

jordan Jan 5, 2024

@nblr It *will* be fine as long as we teach people how to properly code and not blindly trust their tools.

> Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities.

Jason Gorman Jan 5, 2024

@nblr And I'd wager it's not just security vulnerabilities, but bugs in general.

Taureon Jan 5, 2024

@nblr that sounds like a problem that will quickly solve itself away

AdeptVeritatis Jan 5, 2024

Astonishing to me is the amount of people in the comments who think, this is just an optimization problem.

Or the people hoping for a job to clean up the mess.

Even in our own tech bubble, we need to catch up quickly with the education of our peers about AI in general. If this is possible at all.

daughter of the Earth Jan 5, 2024

@nblr YAYYY!!! :D

count Jan 5, 2024

@nblr Yes… ha ha ha… Yes!

peacememories Jan 5, 2024

@nblr following this, and also the slew of “i had an idea and let chatgpt flesh it out, please treat this as proposal with merit” posts , I propose an additional nickname for LLMs:

Dunning-Kruger-O-Tron

Shafik Yaghmour Jan 5, 2024

@nblr Given how badly the models deal w/ poisoning: https://arxiv.org/abs/2311.12202

I think folks actively trying to exploit these models to plant easily exploitable code is only a matter of time.

If you are not fully code reviewing all the code and treating it as if it were adversarial then you are asking for trouble. One has to ask themselves at that point is it worth it?

Nepotistically Trained Generative-AI Models Collapse

Trained on massive amounts of human-generated content, AI-generated image synthesis is capable of reproducing semantically coherent images that match the visual appearance of its training data. We show that when retrained on even small amounts of their own creation, these generative-AI models produce highly distorted images. We also show that this distortion extends beyond the text prompts used in retraining, and that once affected, the models struggle to fully heal even after retraining on only real images.

arXiv.org

@shafik That's the real tradeoff at hand indeed.

Shafik Yaghmour Jan 5, 2024

@nblr I think Computer Science programs not having philosophy and ethics deeply integrated into their programs has really served society very poorly.

gunstick Jan 6, 2024

@nblr hatte da mal eine Diskussion. Wenn man nicht fragt ob es sicher ist, dann bekommt man auch keinen sicheren code.