Leveraging BART to Assess CS1 C++ Programming Assig...

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria
This research explores how to build automated grading systems for introductory C++ programming courses that better reflect how human instructors actually grade. While general-purpose AI models are often trained on professional coding standards, they frequently struggle with the specific conventions taught in introductory classes—such as using using namespace std. This paper investigates whether fine-tuning transformer models with course-specific rubrics and multitask learning can produce more accurate and realistic grade predictions that align with instructor intent.

A Multitask Approach to Grading

The researchers developed a system using a BART encoder-decoder model, which is well-suited for understanding input sequences and generating structured outputs. Instead of training the model to perform a single task, they used a multitask framework that simultaneously predicts a numeric grade and a letter-grade bucket (A–F). By combining these tasks, the model learns from shared information, which helps it avoid the common pitfall of "collapsing" all predictions into the most frequent grade (usually an A). The model also includes a distribution-matching term to ensure the overall spread of predicted grades matches the actual distribution of grades given by instructors.

The Role of Rubrics and Soft Labels

A key part of the study was determining how to best represent grading uncertainty. The researchers compared traditional "hard" labels (one-hot encoding) against "soft" labels (fuzzy membership). They found that using "fuzzy boundary" labels—which only soften the classification when a student’s grade is near a cutoff point—produced the most realistic grade distributions. Additionally, the study incorporated the assignment rubric directly into the model’s input. By providing the model with the specific criteria used for evaluation, the system could better distinguish between submissions based on course expectations rather than just surface-level code patterns.

Key Findings and Performance

The experiments showed that the multitask BART model with boundary-based soft labels and rubric context outperformed single-task and hard-label baselines. While one-hot encoding often achieved high accuracy by simply over-predicting the majority grade (A), the researchers prioritized the Jensen-Shannon Divergence (JSD) metric to measure how well the model’s output matched the true distribution of grades. The multitask approach with boundary-based labels achieved the lowest JSD, meaning it successfully captured the nuances of the grading scale without sacrificing accuracy.

Considerations for Implementation

The study highlights that while automated grading can reduce instructor workload, the way a model is trained significantly impacts its behavior. For instance, adding strict constraints to force consistency between numeric and letter grades often led to worse results, as it tended to push predictions toward the middle of the scale and ignore minority classes like D or F. Ultimately, the findings suggest that for AI to act as a reliable assistant in the classroom, it must be calibrated to match the specific, often subjective, grading behaviors of human instructors rather than just being optimized for raw point accuracy.

Leveraging BART to Assess CS1 C++ Programming Assig... | AI Research

Key Takeaways

A Multitask Approach to Grading

The Role of Rubrics and Soft Labels

Key Findings and Performance

Considerations for Implementation

Comments (0)

No comments yet