1/

Recent commentary [1]:
escalating concern over the use of the more powerful #chatbots when they are used to go beyond the #knowledge of the human expert who uses them, rather than for simply pre-processing in a controlled way within the domain of human-expert knowledge.

1. ⁠What is often called "hallucination/confabulation” (i.e. severe #extrapolation #uncertainty and #overfitting by the chatbot model) is apparently becoming increasingly realistic with a declining human ability to detect it

2/

Another potential key point:

2. apparently, an emerging cognitive bias in humans who allegedly may tend to over-trust #chatbots, especially more advanced ones.

In particular [1]:
"bigger, more-refined versions of #LLMs are, as expected, more accurate [...] But they are less reliable: among all the non-accurate responses, the fraction of wrong answers has increased [...] because the models are less likely to avoid answering a question — for example, by saying they don’t know"

3/

"As expected, the accuracy of the answers increased as the refined models became larger and decreased as the questions got harder [...]
The fraction of wrong answers among those that were either incorrect or avoided rose as the models got bigger, and reached more than 60 %, for several refined models" [1]

The study "found that all the models would occasionally get even easy questions wrong, meaning there is no ‘#SafeOperatingRegion’ in which a user can have high confidence in the answers"

4/

The research [2] noted how "the percentage of incorrect results increases markedly from the raw to the shaped-up models, as a consequence of substantially reducing avoidance [...]
Where the raw models tend to give non-conforming outputs that cannot be interpreted as an answer [...], shaped-up models instead give seemingly #PlausibleButWrong answers [[...]
This does not match the expectation that more recent #LLMs would more successfully avoid answering outside their operating range"

5/

Subtle tradeoff "whether avoidance increases for more difficult instances, as would be appropriate for the corresponding lower level of correctness"

Alas, the
"percentage of avoidant answers rarely rises quicker than the percentage of incorrect ones":
"an involution in #reliability: there is no difficulty range for which #errors are improbable, either because the questions are so easy that the model never fails or because they are so difficult that the model always avoids giving an answer"

6/

#References

[1] Jones, N., 2024. Bigger AI chatbots more inclined to spew nonsense — and people don’t always realize. Nature d41586-024-03137–3+
https://doi.org/10.1038/d41586-024-03137-3

[2] Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., Hernández-Orallo, J., 2024. Larger and more instructable language models become less reliable. Nature. https://doi.org/10.1038/s41586-024-07930-y

#DOI #LargeLanguageModels #chatbots #CognitiveBias

7/

On inaccurate terminology describing #LLM #chatbot errors and unreliability using analogies, one may recall [3]

#LargeLanguageModels (#LLMs): "#MachineLearning systems which produce human-like text and dialogue”

"Large language models simply aim to replicate human speech or writing. This means that their primary goal, insofar as they have one, is to produce human-like text. They do so by estimating the likelihood that a particular word will appear next, given the text that has come before"

8/

On #LLMs "hallucinations" [3]:

"Given this process, it’s not surprising that LLMs have a problem with the truth [...]
The problem here isn’t that large language models #hallucinate, lie, or misrepresent the world in some way. It’s that they are not designed to represent the world at all; instead, they are designed to convey convincing lines of text.
So when they are provided with a database of some sort, they use this, in one way or another, to make their responses more convincing"

9/

On #analogy misuse for #LLMs [3]:

"Calling chatbot inaccuracies ‘hallucinations’ [...] could lead to unnecessary consternation among the general public. It also suggests solutions to the inaccuracy problems which might not work, and could lead to misguided efforts at AI alignment amongst specialists"

#References

[3] Hicks, M.T., Humphries, J., Slater, J., 2024. ChatGPT is bullshit. Ethics and Information Technology 26 (2), 38+. https://doi.org/10.1007/s10676-024-09775-5

ChatGPT is bullshit - Ethics and Information Technology

Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.

SpringerLink

10/

A large study by the European Broadcasting Union (#EBU), and #BBC on #LargeLanguageModels vs public information

"22 Public Service Media (PSM) organizations – across 18 countries and 14 languages – assessed how leading AI assistants answer questions about news and current affairs"

"errors remain at high levels, and [...] are systemic, spanning all languages, assistants and organizations involved. Overall, 45 % of responses contained at least one significant issue of any type" [4]

11/

The study [4] concludes:

"A concerning proportion of assistant responses fall short on basic criteria like #accuracy and providing adequate context – things which are essential editorial values [...]

These problems are exacerbated by the ways assistants make them hard to spot and hard to check, including the confidence with which assistants answer (giving a false sense of #quality and certainty), sources which do not lead to relevant news content, or a complete lack of any sources at all"

12/

The #EBU and #BBC study on #LargeLanguageModels also notes:

"A single case of #misinformation on a news story can be highly impactful, for instance, on issues such as #health, #security or #conflict, or stories with legal implications. All media organizations make occasional errors of the type investigated here, but they also have robust processes to identify, acknowledge and correct those errors. It is important to make sure that the same accountability exists for AI assistants" [4]