Mastodawn

I'm creating #syntheticdata for teaching in the social sciences & find that #SDG with LLMs isn't for my small-scale use. While there are workflows to combine LLMs & generate more credible output ( https://link.springer.com/chapter/10.1007/978-3-031-93418-6_9 ), general-purpose models often create results that are too diverse & reflexive, even when imitating oral communication. Such data reminds me of journalism scandals à la Stephen Glass. High-quality data in my case is more messy and dull. Just look at YouTube comment sections.

A Survey of LLM-Based Methods for Synthetic Data Generation and the Rise of Agentic Workflows

The growing reliance on high-quality datasets for artificial intelligence (AI) development highlights the need for synthetic data generation (SDG) to address data scarcity, privacy concerns, and acquisition costs. Large language models (LLMs) have emerged as key...

SpringerLink