People have been posting glaring examples of ChatGPT’s gender bias, like arguing that attorneys can't be pregnant. So @sayashk and I tested ChatGPT on WinoBias, a standard gender bias benchmark. Both GPT-3.5 and GPT-4 are about 3 times as likely to answer incorrectly if the correct answer defies gender stereotypes — despite the benchmark dataset likely being included in the training data. https://aisnakeoil.substack.com/p/quantifying-chatgpts-gender-bias
Quantifying ChatGPT’s gender bias

Benchmarks allow us to dig deeper into what causes biases and what can be done about it

AI Snake Oil

@randomwalker @sayashk This is worth investigating with Anthropic’s Claude and all the new open source LLMs (LlaMa, Dolly, Hugging Face’s thing etc) that are blooming as well.

Perhaps a weekend project if I have the time!