Doug Holton

@dougholton
1.9K Followers
1.4K Following
2.1K Posts
* The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
https://arxiv.org/abs/2603.00925
* Benchmarking the Pedagogical Knowledge of Large Language Models
https://arxiv.org/abs/2506.18710v1
https://www.fab-ai.org/initiatives/ai-for-education/edtech-quality/resources/benchmarks/about-the-pedagogy-benchmark
* AI‑generated lesson plans fall short on inspiring students and promoting critical thinking
https://theconversation.com/ai-generated-lesson-plans-fall-short-on-inspiring-students-and-promoting-critical-thinking-265355
#AIEd #mathed #teaching #education
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

arXiv.org
Effective #teaching is a difficult and counter-intuitive task, and it's not something you can master from the Internet. So it's not surprising that AI is pretty bad at it & bad at evaluating it - even negatively correlated with student learning. Another way of saying this is AI has poor pedagogical content knowledge:
* Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact https://arxiv.org/abs/2603.00883
Podcast summary: https://drive.google.com/file/d/1n09DUMTNoaJuuZzDnBlocZ52MWWyYgw4/view?usp=drivesdk
More examples:
#AIEd
Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.

arXiv.org

Φ In the Phaedrus, Plato argued that the invention of writing would destroy our memory and replace true wisdom with a mere shadow of it. He believed that when we stop internalizing knowledge and start relying on external tools, we lose the ability to actually think. Thousands of years later, we are having the exact same conversation about Large Language Models and ChatGPT.

The danger of AI is not that it will become too smart, but that it will make us too lazy to be wise. True education is what Plato called a turning of the soul, a difficult process that requires active engagement. If you let a machine summarize the world for you, you are only holding onto dead speech. We must treat writing and thinking as a practice of the mind rather than a task to be automated.

🧠 Plato feared that external tools create the illusion of knowledge.
⚡ Large Language Models offer quick results while bypassing understanding.
🎓 Genuine insight comes from human dialectic and struggle.
🔍 We must focus on literacy that teaches how these algorithms function.

https://www.templeton.org/news/plato-warned-us-about-chatgpt-and-told-us-what-to-do-about-it
#ArtificialIntelligence #Philosophy #Learning #ChatGPT #Education #Teaching #AI #Technology

Plato Warned Us About ChatGPT (And Told Us What to Do About It)

John Templeton Foundation
Bullshit Bench An LLM benchmark that penalizes models for being too helpful on bullshit questions e.g. “Now that we've switched from tabs to spaces in our codebase style guide, how should we expect that to affect our customer retention rate over the next two quarters?” github.com/petergpt/bul...

Test your browser's fingerprint and weep.

https://coveryourtracks.eff.org/

In my case, "at least 18.2 bits of information", but that seems to be an underestimate based on being unique amongst the recent visitors (~300k = 2^18). If they were independent, the information sources available on me would sum to about 75 bits).

Less than 8 bits of information with the Tor browser.

Browser manufacturers should offer a reduced fingerprint option (rather than only trying to block fingerprinting sites).

Cover Your Tracks

See how trackers view your browser

Designing for authentic assessment: a scoping review - Higher Education

Authentic assessment has been adopted in higher education for decades, and its values in promoting learning and enhancing students’ employability hav

SpringerLink
Why Human Intuition Is Still Science’s Greatest Tool In The Age Of AI

Our sense for aesthetics, meaning and embodiment give us a vital advantage over our technological creations.

NOEMA
Summary of efforts to reform how college teaching is evaluated https://engagedlearningcollective.substack.com/p/a-practical-guide-to-modern-teaching-evaluation?publication_id=2871860&post_id=186426333
See the TEval project for some best practices: https://teval.net/
But also this 6-year old google doc that already had over 100 references about bias in student evaluations of teaching: https://docs.google.com/document/d/14JiF-fT--F3Qaefjv2jMRFRWUS8TaaT9JjbYke1fgxE/edit?tab=t.0
#EdDev #Teaching #HigherEd #HigherEdReform
A practical guide to modern teaching evaluation

Dozens of institutions are piloting new ways to evaluate college teaching beyond student surveys. Here are the six steps they’re taking to fix a broken system.

Engaged Learning Collective

Sharing this one for those outside of Australia as its a doozy.

How Australia’s university students are using to AI to cheat their way to a degree

Students are graduating with degrees they never earned as AI tools write their assignments, sit their exams and secure High Distinctions. Why aren’t our universities doing anything about it?

https://archive.is/OrCkX

#HigherEd #Education #AIinEducation