The paper describes an interesting methodology. We study if #GenerativeAI can support the information needs that patients have when they are navigating the healthcare systems. Particularly, when they are trying to understand their scans and reports to educate themselves. We studied patient-provider interactions to identify 10 types of information needs patients have. From these, we generated evaluation datasets to measure how well #ChatGPT and #MedFlamingo. (3/n)

#AI, #ML science has a lot to learn from #HCI, #Psychology, #economics on evaluation methods.

Datasets are curated with often implicit assumptions about what the usecase is. E.g, #GenAI systems (#MedFlamingo, #MedPalm) are evaluated with QA from #USMLE. The questions in these datasets are designed to test a clinician's knowledge and memory. A model's performance on these datasets tells us NOTHING about if it is a good information source for lay persons. (2/n)