Can Large Language Models (#ChatGPT) transform Computational Social Science?

Our recent work (with @Held, @omar, @diyiyang) shows how they might (in partnership w/ experts).

We evaluate on 24 #CSS tasks + draw a roadmap πŸš—πŸ—ΊοΈ to guide #LLM-augmented social science πŸš€

Paper: https://calebziems.com/assets/pdf/preprints/css_chatgpt.pdf

🧡 thread

1️⃣ Can #LLMs augment human annotation to increase quality + save time?

βœ… Yes! #LLMs have fair agreement w/ humans on 12/17 tasks (0.2 < kappa < 0.7).

LLMs can join humans in a #MajorityVote to reliably label text w/ 7%-50% less human effort (so invest savings in #experts!)

2️⃣ Can #LLMs replace human annotation?

❌ not for expert taxonomies (#ImplicitHate) or parsing tasks (#ArgumentExtraction) [<40% acc.]

❔ maybe where objective ground truth (#misinfo) or common definitions (#emotions) exist [>70% acc.]

(but human-in-the-loop is recommended)

3️⃣ Can LLMs help humans code unstructured text w/ open-ended generations?

βœ… yes, humans prefer #ChatGPT explanations just as often as gold references

4️⃣ Can LLMs replace human inductive analysis?

❌ no, LLMs don't outperform humans; experts should instead curate #LLM outputs

5️⃣ Are #LLMs better at some scientific fields rather than others?

❌ we don't see any systematic bias against any field

πŸ€” instead, performance varies more by the complexity of the input --- document-level analysis is the most challenging!

6️⃣ How should I decide which model to use?

Keep the following in mind:

πŸ“ˆ performance scales w/ model size
🎡 #FLAN lets you tune w/ your own labels
πŸ’² #ChatGPT is often cheapest
πŸ’¬ #ChatGPT is best for generation
πŸ’» code-instructed #GPT3 excels at parsing

7️⃣ How can I get the most out of my model?

We recommend these best-practices…

πŸ”  enumerate options with multiple-choice
↩️ separate options with new lines
⚠️ give instructions and repeat constraints *after* the context
πŸ€– ask for machine-parseable JSON

8️⃣ What's missing?

πŸ† reliable auto-metrics for CSS performance
🌱 model grounding
πŸ“° real-time analysis of late-breaking events
↔️ causal explanations

9️⃣ What's next?

πŸ’‘ a new #CSS paradigm that blurs supervised / unsupervised methods

πŸš€ a unified #LLM approach to human-#AI hypotheses formation, data coding, and hypothesis testing via text analysis