"We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant... wrote significantly less secure code than those without access. ...participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. "

https://arxiv.org/abs/2211.03622

Do Users Write More Insecure Code with AI Assistants?

We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI's codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants' language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

arXiv.org

One thing I've never liked about this diagram is it assumes there is an upward slope out of the "trough of disillusionment".

There's no guarantee that any of these AI bots become any good.

Another study showed ChatGPT's performance varied significantly month to month, and over the time period in the study got worse at many tasks.

https://arxiv.org/abs/2307.09009#:~:text=We%20find%20that%20the%20performance,same%20questions%20(accuracy%202.4%25)

How is ChatGPT's behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

arXiv.org
@solidangle if 1 AI-bot can not be "good" - make 10, if 10 can not be good - make ten thousands -... etc repeat... - one of them will be good.
Now you know why corporation on AI-hype and they need all compute power possible.
Even one success result on trillion AI-bots is better than "hiring another human and expecting them to perform".