A new note from my website: “Generative themes without Generative AI”: https://meaningmaking.it/generative-themes-without-generative-ai-nutella-orangutans-and-a-dialogue-with-my-son/
A new note from my website: “Generative themes without Generative AI”: https://meaningmaking.it/generative-themes-without-generative-ai-nutella-orangutans-and-a-dialogue-with-my-son/
I observed a third year undergrad class of a module that's teaching use of GenAI for the creative industries this afternoon. The briefing on the assessment.
It's the first time it's taught, so it's, uh.. finding its footing. But the students demand exemplars.
So the prof made some, but clearly did not invest the very effort expected of students. Then again she acknowledges the existence of the module is a performative bow to market pressures.

LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.

When students use GenAI to skip the hard parts of learning, they miss the productive struggle that builds genuine expertise. This post explores why the "AI is just like a calculator" argument falls short, and why novices are most at risk of outsourcing the thinking that matters most.
New post: GenAI makes it easy for students to skip the struggle, but productive struggle is where real learning happens. The students most likely to over-rely on it have the most to lose. What's the "minimum viable struggle" we need to protect? #ArtificialIntelligence #AIEducation #AIEd #AIInEd

When students use GenAI to skip the hard parts of learning, they miss the productive struggle that builds genuine expertise. This post explores why the "AI is just like a calculator" argument falls short, and why novices are most at risk of outsourcing the thinking that matters most.

GenAI in education is a sprawling topic, so each January I try to distill it into a single post: what's changed, what's most important, and what you can actually do with the technology. This is 2026's introduction to GenAI: I'll dig deeper into each section throughout the year.
GenAI in education is a sprawling topic, so each January I try to distill it into a single post: what's changed, what's most important, and what you can actually do with the technology. This is 2026's introduction to GenAI: I'll dig deeper into each section throughout the year. #AI #AIedu #AIEd
https://leonfurze.com/2026/01/15/everything-educators-need-to-know-about-genai-in-2026/

GenAI in education is a sprawling topic, so each January I try to distill it into a single post: what's changed, what's most important, and what you can actually do with the technology. This is 2026's introduction to GenAI: I'll dig deeper into each section throughout the year.