I've spent the last two years trying to understand how LLMs might improve middle-school math education. I just published an article in the Journal of Educational Data Mining describing some of that work: "Designing Safe and Relevant Generative Chats for Math Learning in Intelligent Tutoring Systems" https://jedm.educationaldatamining.org/index.php/JEDM/article/view/840
Journal of Educational Data Mining

Large language models (LLMs) are flexible, personalizable, and available, which makes their use within Intelligent Tutoring Systems (ITSs) appealing. However, their flexibility creates risks: inaccuracies, harmful content, and non-curricular material. Ethically deploying LLM-backed ITSs requires designing safeguards that ensure positive experiences for students. We describe the design of a conversational system integrated into an ITS that uses safety guardrails and retrieval-augmented generation to support middle-grade math learning. We evaluated this system using red-teaming, offline analyses, an in-classroom usability test, and a field deployment. We present empirical data from more than 8,000 student conversations designed to encourage a growth mindset, finding that the GPT-3.5 LLM rarely generates inappropriate messages and that retrieval-augmented generation improves response quality. The student interaction behaviors we observe provide implications for designers---to focus on student inputs as a content moderation problem---and implications for researchers---to focus on subtle forms of bad content and creating metrics and evaluation processes.Code and data are available at https://www.github.com/DigitalHarborFoundation/chatbot-safety and https://www.github.com/DigitalHarborFoundation/rag-for-math-qa.