Back to AI Research

AI Research

Leveraging BART to Assess CS1 C++ Programming Assig... | AI Research

Key Takeaways

  • Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria This research explores how to build automated grading systems for intro...
  • Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input.
  • Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants.
  • Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines.
  • Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity.
Paper AbstractExpand

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria
This research explores how to build automated grading systems for introductory C++ programming courses that better reflect how human instructors actually grade. While general-purpose AI models are often trained on professional coding standards, they frequently struggle with the specific conventions taught in introductory classes—such as using using namespace std. This paper investigates whether fine-tuning transformer models with course-specific rubrics and multitask learning can produce more accurate and realistic grade predictions that align with instructor intent.

A Multitask Approach to Grading

The researchers developed a system using a BART encoder-decoder model, which is well-suited for understanding input sequences and generating structured outputs. Instead of training the model to perform a single task, they used a multitask framework that simultaneously predicts a numeric grade and a letter-grade bucket (A–F). By combining these tasks, the model learns from shared information, which helps it avoid the common pitfall of "collapsing" all predictions into the most frequent grade (usually an A). The model also includes a distribution-matching term to ensure the overall spread of predicted grades matches the actual distribution of grades given by instructors.

The Role of Rubrics and Soft Labels

A key part of the study was determining how to best represent grading uncertainty. The researchers compared traditional "hard" labels (one-hot encoding) against "soft" labels (fuzzy membership). They found that using "fuzzy boundary" labels—which only soften the classification when a student’s grade is near a cutoff point—produced the most realistic grade distributions. Additionally, the study incorporated the assignment rubric directly into the model’s input. By providing the model with the specific criteria used for evaluation, the system could better distinguish between submissions based on course expectations rather than just surface-level code patterns.

Key Findings and Performance

The experiments showed that the multitask BART model with boundary-based soft labels and rubric context outperformed single-task and hard-label baselines. While one-hot encoding often achieved high accuracy by simply over-predicting the majority grade (A), the researchers prioritized the Jensen-Shannon Divergence (JSD) metric to measure how well the model’s output matched the true distribution of grades. The multitask approach with boundary-based labels achieved the lowest JSD, meaning it successfully captured the nuances of the grading scale without sacrificing accuracy.

Considerations for Implementation

The study highlights that while automated grading can reduce instructor workload, the way a model is trained significantly impacts its behavior. For instance, adding strict constraints to force consistency between numeric and letter grades often led to worse results, as it tended to push predictions toward the middle of the scale and ignore minority classes like D or F. Ultimately, the findings suggest that for AI to act as a reliable assistant in the classroom, it must be calibrated to match the specific, often subjective, grading behaviors of human instructors rather than just being optimized for raw point accuracy.

Comments (0)

No comments yet

Be the first to share your thoughts!