#AI, #ML science has a lot to learn from #HCI, #Psychology, #economics on evaluation methods.
Datasets are curated with often implicit assumptions about what the usecase is. E.g, #GenAI systems (#MedFlamingo, #MedPalm) are evaluated with QA from #USMLE. The questions in these datasets are designed to test a clinician's knowledge and memory. A model's performance on these datasets tells us NOTHING about if it is a good information source for lay persons. (2/n)