This is fine...
"We observed that participants who had access to the AI assistant were more likely to introduce security vulnerabilities for the majority of programming tasks, yet were also more likely to rate their insecure answers as secure compared to those in our control group."

https://arxiv.org/abs/2211.03622

Do Users Write More Insecure Code with AI Assistants?

We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI's codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants' language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

arXiv.org
@nblr How does the old saying go? Ah yes, garbage in, garbage out.
@daniel_bohrer „of course it’s broken but look how fast it is!“
@nblr ...which is surprising, given the AI models have been trained on well-audited high quality code.
.
.
.
.
.
Oh, wait...
@dickenhobelix @nblr This is very likely why. They probably could retrieve and explore a semantic space with more secure code, but convincing it to share that? Difficult.
@nblr before anyone jumps to conclusions... n=47 participants. "Belief in the Law of Small Numbers"?
@tg9541 @nblr people tend to overestimate the sample sizes necessary for valid statistical conclusions, see https://www.calculator.net/sample-size-calculator.html?type=2&cl2=95&ss2=50&pc2=60&ps2=&x=Calculate#findci
Sample Size Calculator

This free sample size calculator determines the sample size required to meet a given set of constraints. Also, learn more about population standard deviation.

@tg9541 @nblr For reference, for a 95% confidence level you need a sample size of ~400, for an essentially unlimited population size.
@Profpatsch @nblr This small a sample size requires a belief in a single quality that's to be assessed. As soon as there is a belief in a complex causation of the assumed quality a much more careful sampling of the population is required. Objectivity of statistics is a mirage. As an example, the authors of the paper believed that qualities like Occupation, Gender, Age, etc., play a role in the effect for RQ1 to RQ3. I have no idea about the required sample size but some will do the math.
@Profpatsch @nblr please don't take it personal if I say that this works very well in the identified domain, i.e., for coin-flips.
@nblr the problem wouldn't be good engineers, it'll be *prompt* engineers that hit generate until their tests pass.
@nblr users who write worse code (i.e., use any form of AI to write their code) certainly write less secure code. The best part is that they also don't know enough about what they wrote to consider use cases or conditions in which security, or bugs, or edge cases become a factor
@nblr And this is why I'm not worried about AI tools taking development jobs.
@jonyoder @nblr You might want to reconsider. Folks who write solid, secure code take their time and aren't cranking out hundreds of lines of code a day. Metrics-wise (because that's all that seems to matter these days), they're the first to get laid off. The folks who're using LLMs to pump out risky code are the ones who are kept.
@drwho @nblr You're not wrong, but that sounds more like a management problem TBH. If leadership uses lines of code as a performance metric, then I'm in the wrong company anyway and I'll let their poor choices bite them where they sit if they won't listen.
@jonyoder @nblr That is a very large number of companies these days. Enough that it's a bit odd to find companies that don't roll this way.
@drwho @jonyoder @nblr depends on the metrics - I’m lucky enough to have never worked at places that measure loc produced, but have worked at places that measured issues raised,number of times an issue was returned to dev, etc.
@gothytim @jonyoder @nblr I'm working at only the second place ever in my career where they don't do that. It's been a thing since the late 90's.
@drwho @gothytim @nblr Ouch. Doesn't say anything positive about the industry, to be sure

@nblr I find this chart in Section 5 to be particularly interesting. Specifically the "I trusted the AI to produce secure code" section. For the Experiment group, the one who got secure answers for Q2 and Q3 didn't trust the AI to write secure code, but the ones who didn't provide secure answers overwhelmingly trusted it.

Also that for the question of "I solved this task securely" for Q2, the Experiment group did solve it securely were 100% confident of this. Yet, they strongly agreed to not trusting the AI to solve it securely??

Bit odd, innit?

@ApisNecros @nblr i don't think that's odd at all but rather obvious: they didn't trust the AI, did their own additional research and fixed/amended the AI code - just as other coders fix and amend the stackoverflow code they find ;)

@nblr Reading the methodology, I think what this study shows is that these chat interfaces are not really ready to be web searches yet.

Bot control and experiment groups had access to external web browsers. They didn't seem to track the variable use, but noted many people used the browser.

It's pretty interesting! Thanks for sharing!

Some spare thoughts: I wonder if Google's new entry would have done better here given that their integration is noticeably better at integrating their own search results, which are still industry leading despite the decay of the interface as a whole due to ad placement.

Also, I wonder if folks realize what a truly difficult challenge prompt engineering is for code generation. My experience with these tools is that unless there is a big codebase already there with examples of the security posture expected, it's brutally difficult to get the LLM to keep consistent on security postures.

@nblr this one paper isn’t the only research.

Disclaimer: I work for GitHub, not on Copilot (though I use it).

This is specifically about GitHub Copilot:

https://deepai.org/publication/copilot-security-a-user-study

I didn’t cherry pick.

You can also see the GitHub Copilot Trust Center for more on using Copilot for developing secure code:

https://resources.github.com/copilot-trust-center/#:~:text=how%20github%20copilot%20aids%20secure%20development

#LLM #GitHub #Copilot

Copilot Security: A User Study

08/12/23 - Code generation tools driven by artificial intelligence have recently become more popular due to advancements in deep learning and...

DeepAI
@nblr AI creates jobs in computer security the same way an arsonist creates jobs in emergency response
@nblr it's all going very well, I see.
@nblr Should 'citizen integrators' be trained to implicitly 'insert' their organization's data governance policy into a request for code, or that the organization 'funding' the GPT-AI tool to insert this policy globally, or should the check-in process be changed - which would ultimately be AI-driven anyway? 😆

@nblr

DKaaS: Dunning-Kruger as a service!

@nblr now I want to run the same experiment with journalism students, but measure factual errors instead of security vulnerabilities
@nblr sadly only very small N.
> Ultimately, we recruited 54 participants

@nblr It *will* be fine as long as we teach people how to properly code and not blindly trust their tools.

> Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities.

@nblr And I'd wager it's not just security vulnerabilities, but bugs in general.
@nblr that sounds like a problem that will quickly solve itself away

@nblr

Astonishing to me is the amount of people in the comments who think, this is just an optimization problem.  

Or the people hoping for a job to clean up the mess.

Even in our own tech bubble, we need to catch up quickly with the education of our peers about AI in general. If this is possible at all. 

@nblr Yes… ha ha ha… Yes!

@nblr following this, and also the slew of “i had an idea and let chatgpt flesh it out, please treat this as proposal with merit” posts , I propose an additional nickname for LLMs:

Dunning-Kruger-O-Tron

@nblr Given how badly the models deal w/ poisoning: https://arxiv.org/abs/2311.12202

I think folks actively trying to exploit these models to plant easily exploitable code is only a matter of time.

If you are not fully code reviewing all the code and treating it as if it were adversarial then you are asking for trouble. One has to ask themselves at that point is it worth it?

Nepotistically Trained Generative-AI Models Collapse

Trained on massive amounts of human-generated content, AI-generated image synthesis is capable of reproducing semantically coherent images that match the visual appearance of its training data. We show that when retrained on even small amounts of their own creation, these generative-AI models produce highly distorted images. We also show that this distortion extends beyond the text prompts used in retraining, and that once affected, the models struggle to fully heal even after retraining on only real images.

arXiv.org
@shafik That's the real tradeoff at hand indeed.
@nblr I think Computer Science programs not having philosophy and ethics deeply integrated into their programs has really served society very poorly.
@nblr hatte da mal eine Diskussion. Wenn man nicht fragt ob es sicher ist, dann bekommt man auch keinen sicheren code.