AI Research

Temporal Stability and Few-Shot Prompting in Math T... | AI Research

Key Takeaways

Temporal Stability and Few-Shot Prompting in Math Task Assessment As AI tools become increasingly common in classrooms, educators and researchers need to kno...
As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques.
This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks.
In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks.
We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach).

Paper AbstractExpand

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

Temporal Stability and Few-Shot Prompting in Math Task Assessment

As AI tools become increasingly common in classrooms, educators and researchers need to know if these systems remain reliable over time and how to get the best performance out of them. This study investigates how well AI models can classify the "cognitive demand" of mathematics tasks—a key educational metric—using the Task Analysis Guide (TAG). The researchers examined whether AI performance shifts when models are updated and whether using "few-shot prompting" (providing the AI with a few examples before asking it to perform a task) can improve accuracy.

Testing AI in the Classroom

The researchers evaluated two types of AI: a general-purpose tool (Gemini) and an education-specific tool (Coteach). These models were chosen because they had previously performed well on educational benchmarks. The study followed a longitudinal approach, testing the models at a baseline, retesting them after official model version updates, and finally testing them again after applying few-shot prompting, which involved providing two exemplar tasks for each category of cognitive demand.

The Impact of Model Updates

The study found that simply waiting for newer versions of an AI model does not guarantee better results. In fact, the effects of version updates were inconsistent. While the general-purpose model (Gemini) maintained a stable accuracy of 58% across updates, the education-specific model (Coteach) saw its performance drop significantly, falling from 75% to 50% accuracy after an update. This suggests that passive improvements from developers do not always translate to better performance on specialized educational tasks.

The Power of Prompt Engineering

In contrast to the mixed results from model updates, the researchers found that few-shot prompting was a highly effective way to boost performance. By providing the models with just two examples per category, both tools showed marked improvement. Gemini’s accuracy rose to 67%, and Coteach successfully recovered its performance, returning to 75% accuracy.

Key Takeaways for Educators

The findings suggest that prompt engineering—the deliberate design of how we ask AI to perform tasks—can have a more reliable and significant impact on performance than relying on automatic model updates. For those implementing AI in educational settings, this indicates that the way a tool is prompted is just as important as the tool itself. Educators and researchers should prioritize testing and refining their prompts rather than assuming that newer AI versions will inherently be better at specialized tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!