“I’d created 2000 free-text responses and labelled them ‘UK’. Then I copied and pasted the exact same 2000 responses but labelled these ‘US’. Finally, I combined them to create a dataset of 4000 total responses, and jumbled them up.

Despite the responses being identical for the UK and US, Copilot produced a rich, detailed summary of how US and UK respondents differed.”

https://kucharski.substack.com/p/real-signals-or-artificial-stereotypes

H/T @sinalana.eurosky.social

Real signals or artificial stereotypes?

Adventures with a cultural Copilot

Understanding the unseen
@gregeganSF @sinalana.eurosky.social
I think that is a rather beautiful bit of work, even if it is a fairly standard null-test that any scientist or software engineer worth their salt would run. It demonstrates very clearly that you shouldn't use LLMs to analyze a dataset if you don't want that analysis to be polluted by data from outside that dataset. This more or less confirms what I have felt all along: that you should only use LLMs as a source of suggestions, and that you still need to apply your own intelligence to decide whether or not to go with those suggestions.