Back to AI Research

AI Research

Gumbel Machine: Counterfactual Student Writing Gene... | AI Research

Key Takeaways

  • What the paper is about An effective method of teaching across disciplines is to provide examples of high-quality work.
  • An effective method of teaching across disciplines is to provide examples of high-quality work.
  • However, an example may be significantly different from a student's current work, making it challenging for them to emulate.
  • An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own.
  • Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications.
Paper AbstractExpand

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $\beta$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

What the paper is about

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $\beta$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

What it covers

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering Hunter McNichols 1 , Alexander Scarlatos 1 , Mihai Dascalu 2 , Danielle McNamara 3 , Andrew Lan 1 1 University of Massachusetts Amherst 2 University Politehnica of Bucharest 3 Arizona State University [email protected], [email protected] [email protected], [email protected], [email protected] Abstract An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student’s current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, β \beta -Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference. Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering Hunter McNichols 1 , Alexander Scarlatos 1 , Mihai Dascalu 2 , Danielle McNamara 3 , Andrew Lan 1 1 University of Massachusetts Amherst 2 University Politehnica of Bucharest 3 Arizona State University [email protected], [email protected] [email protected], [email protected], [email protected] 1 Introduction Providing feedback on student work, especially open-ended work, is an essential aspect of education since it helps learners identify specific areas for improvement Hattie and Timperley ( 2007 ) . There are two common ways to provide feedback: using a set of rubrics to score student work and highlight areas for improvement, and giving an example of model work. The latter faces a problem: while examples show students what exceptional work looks like, the model example might differ considerably from the student’s work in terms of style, approach, or personal tendencies. This potential mismatch can make it difficult for students to see how the example can be applied to their work, thereby limiting the effectiveness of such feedback. Ideally, when receiving feedback, a student would receive a personalized, counterfactual version of their work, which indicates small but pedagogically informative changes they can make. Figure 1: Overview of counterfactual student writing generation. Our approach produces a revised version of student work that improves a specific rubric criterion while remaining similar to the original. While personalized counterfactual student work can be pedagogically helpful, manually crafting it can be labor-intensive and hard to scale. However, when student work is textual, recent advances in natural language processing, particularly Large Language Models (LLMs), provide opportunities for automation. Recent work leverages LLMs to generate counterfactual examples for a wide range of purposes, such as story rewriting Chen et al. ( 2022 ) , augmenting training data Qiu et al. ( 2024 ); Zhang et al. ( 2025 ) , or enhancing model interpretability Ross et al. ( 2021 ) . These findings show the promise of using LLMs to automate the generation of counterfactual student work. Existing LLM-based approaches to automated counterfactual generation must address a central design challenge: balancing the tradeoff of task performance and similarity to the original factual text. Since these two objectives are often in opposition, a variety of approaches exist to address this tradeoff. Some approaches optimize a language model for this joint objective Jung et al. ( 2022 ); Hu and Li ( 2021 ); Dathathri et al. ( 2019 ) , thereby encoding the trade-off in its weights. Another common pattern across approaches is to decompose the problem into identification and replacement steps, where one model identifies words of the factual text that should be revised and a second replaces them for the counterfactual task. The models used in each step are either LLMs trained for the specific step on a given task Ross et al. ( 2021 ); Treviso et al. ( 2023 ) or a large pretrained LLMs Chen et al. ( 2023 ); Dixit et al. ( 2022 ); Sachdeva et al. ( 2024 ); Wang et al. ( 2025 ) . This decomposition-based approach implicitly enforces similarity by the assumption that changing only the words with maximum influence will minimize the total number of words changed, thereby enforcing similarity by construction. While existing approaches provide promising directions towards generating counterfactual student work, significant work remains to make them adoptable in real classrooms Filighera et al. ( 2022b ) . One limitation is the tight coupling between task alignment and similarity control: many approaches rely on training task-specific classifiers or control codes, which jointly define what constitutes a valid counterfactual and how to generate them. As a result, adapting to a new task often requires retraining or system redesign, which limits generalizability. A second limitation is that most existing works frame counterfactual generation in terms of nominal labels, such as sentiment, topic, or toxicity. In these settings, success is defined by crossing a decision boundary since counterfactuals are evaluated by the resulting change in label. In education, however, student work is often scored in an ordinal way. Therefore, counterfactual work needs to reflect nuanced adjustments in student work that lead to continuously improved quality. Contributions In this paper, we propose the Gumbel Machine, a modular approach for counterfactual generation that decouples similarity control from counterfactual task alignment. We first introduce β \beta -Hindsight control, a model-agnostic, inference-time controlled decoding mechanism that recovers and reuses the stochastic state implied by a reference to induce similarity during generation. Furthermore, we detail how to combine this mechanism with a preference optimization strategy to improve an LLM’s adherence to a rubric. Together, these components form our counterfactual generation approach that enables both specific control over task validity and adjustable control over similarity to a reference. We evaluate our combined approach on two real-world datasets of scored student-written essay summaries using an ordinal rubric, showing flexibility beyond standard nominal evaluation settings. We show that our approach yields counterfactuals that are more similar and valid compared to multiple baselines, on both open-weight and proprietary models. In addition, we hire teachers to evaluate our approach and find that they consider our counterfactuals more similar to the original student work and, in such cases, prefer them for pedagogical use. Finally, we perform ablation studies on our approach to study the conditions under which the recovered stochastic state serves as an effective control signal. 2 Background and Related Work 2.1 Gumbel-Max Trick In our approach to generating counterfactual student work, we leverage the Gumbel-Max trick, which we briefly detail below. The Gumbel-Max trick is a procedure to sample from an arbitrary categorical distribution that involves adding noise sampled from a Gumbel distribution Maddison et al. ( 2014 ) . The samples from this procedure are shown to be equivalent to directly sampling from the categorical distribution, but the trick has the added benefit of decomposing the sampling into a reusable noise artifact that can be used to replay the randomness of the sampling process. Figure 2: Overview of the Gumbel Machine. In Stage 1, we recover noise that would have produced the observed tokens. In Stage 2, we intervene on the prompt with a new target score and replay the noise, steering generation toward a modified score while preserving similarity to the reference. In this example, “exposed” is selected, while the original token “in,” is a close alternative. A tunable β \beta scales the influence of noise-based similarity on replay. An LLM uses the softmax to sample tokens: P ​ ( Y = v ) = exp ⁡ ( ℓ v ) ∑ v ′ ∈ V exp ⁡ ( ℓ v ′ ) . P(Y=v)=\frac{\exp(\ell_{v})}{\sum_{v^{\prime}\in V}\exp(\ell_{v^{\prime}})}. Here, Y Y is a random variable for the next output token; V V is the model’s vocabulary, the set of all tokens the model is trained to output; and ℓ v \ell_{v} are the logits for each token v ∈ V v\in V . Since we express Y Y in terms of logits, we can use the Gumbel-Max trick to sample the next token instead of sampling from P ​ ( Y ) P(Y) directly. We draw a noise sample from the Gumbel distribution for each token and add it to the logits. We then take the maximum perturbed logit index and find the corresponding token: y = arg ⁡ max v ∈ V ⁡ ( ℓ v + g v ) . y=\arg\max_{v\in V}(\ell_{v}+g_{v}). Here, g v ∼ Gumbel ​ ( 0 , 1 ) g_{v}\sim\text{Gumbel}(0,1) is an independent draw from a standard Gumbel distribution. This simple procedure is mathematically equivalent to directly sampling from P ​ ( Y = v ) P(Y=v) . Moreover, the Gumbel noise values can be stored to preserve the random state of the sampling process. In our approach, we use the recovered Gumbel noise for counterfactual generation, as it enables us to keep the exogenous component of sampling fixed after applying an intervention to a language model. Concretely, we record the sampled Gumbel values, update the LLM input, and then replay the Gumbel-Max trick with the recorded “factual” noise values on the counterfactual logits. 2.2 Hindsight Gumbel Sampling The Gumbel-Max trick provides a mechanism for preserving and replaying the random noise used in discrete random processes. However, in practice, we often do not have access to the random process and can only observe samples. For example, the random process that generates student writing is controlled by the students themselves, not a computational system that produces logits. Therefore, we need to approximate the random process and recover the noise implied by the observations. Ravfogel et al. ( 2025 ) refers to the task of recovering noise from a proxy as “Hindsight Gumbel Sampling.” In their work, the authors use an LLM as a proxy for the textual data generation process and outline an algorithmic procedure to recover Gumbel noise implied by data. We build on this foundation and demonstrate that recovered noise can serve as a tunable similarity-control mechanism. 2.3 Related Work Counterfactual Generation. Counterfactual text generation has been widely studied, within NLP, for applications for a variety of tasks including sentiment analysis Kaushik et al. ( 2020 ); Madaan et al. ( 2021 ) , story rewriting Chen et al. ( 2022 ) , question answering Sachdeva et al. ( 2024 ); Paranjape et al. ( 2022 ) , relation extraction Miao et al. ( 2024 ) , and domain adaptation Calderon et al. ( 2022 ); Wu et al. ( 2021 ) . Recent work has also explored applications of LLMs for counterfactual generation, particularly for data augmentation and model explainability Goethals et al. ( 2025 ); Zhang et al. ( 2025 ); Wang et al. ( 2025 ); Gat et al. ( 2023 ); Wen et al. ( 2022 ) . Counterfactuals in Education. Extensive work has been done in education to provide feedback for short answer writing Filighera et al. ( 2022a ); McNichols et al. ( 2024 ); Scarlatos et al. ( 2024 ) . Closely related to counterfactual generation is the task of paraphrase generation Ouahrani and Bennouar ( 2024 ) , but existing work on counterfactual generation remains limited, highlighting the need for further development in this domain. Gumbel Noise for LLM Generation. Recent work has explored the use of Gumbel noise for consistency during LLM generation De Mijolla et al. ( 2025 ) . Subsequent work explores the application of Gumbel noise to counterfactual generation for analyzing language model bias Chatzi et al. ( 2025 ) . Ravfogel et al. ( 2025 ) introduces a procedure for reference-based recovery and replay of Gumbel noise for counterfactual generation. Our approach builds on this line of work and introduces several key innovations. First, this work demonstrates that the recovered Gumbel Noise can serve as a tunable reference-similarity control signal during generation time. Second, our work is the first to extensively evaluate the quality of the Gumbel noise-generated counterfactuals for an applied downstream task. Finally, we introduce a practical variant of hindsight Gumbel sampling that does not require computing the normalization constant. 3 Methodology In this section, we formalize the task of counterfactual generation and our approach to this task. 3.1 Counterfactual Generation Task In our setting, we treat both factual and counterfactual outputs as token sequences generated by an LLM in response to an instruction prompt. Formally, we assume a pretrained autoregressive LLM f θ : 𝐱 → 𝐲 f_{\mathbf{\theta}}:\mathbf{x}\rightarrow\mathbf{y} , parameterized by weights θ \mathbf{\theta} . We define 𝐱 = ( x 1 ​ … ​ x N ) \mathbf{x}={(x_{1}\dots x_{N})} as a sequence of tokens forming the instruction (prompt) to the model, and 𝐲 = ( y 1 ​ … ​ y M ) \mathbf{y}={(y_{1}\dots y_{M})} as the output tokens sampled from the model. We define a counterfactual as an alternative output f ​ ( 𝐱 ′ ) θ = 𝐲 ′ f(\mathbf{x^{\prime}}){\mathbf{\theta}}=\mathbf{y}^{\prime} obtained after an intervention to the input prompt 𝐱 ′ \mathbf{x}^{\prime} . In counterfactual generation, our goal is to create an output that reflects a category change intervention in the input while remaining similar to a factual reference. Formally, given a reference 𝐲 r \mathbf{y}{r} in category z z , we need to output 𝐲 ′ \mathbf{y}^{\prime} in target category z ′ z^{\prime} , where z ≠ z ′ ∈ { 0 , … , k } z\neq z^{\prime}\in{0,\dots,k} for a task with k k distinct categories. In tasks such as counterfactual topic generation, z z is interpreted as a nominal label, but in our counterfactual student work generation task, we interpret z z as an ordinal score. To determine a category for an arbitrary output, we introduce a scoring model s : 𝐲 → z ^ s:\mathbf{y}\rightarrow\hat{z} to estimate a score for an arbitrary output, and we say a counterfactual is valid when s ​ ( 𝐲 ′ ) = z ′ s(\mathbf{y}^{\prime})=z^{\prime} . 3.2 β \beta -Hindsight Similarity Control We now detail our inference-time similarity control approach for counterfactual generation, outlined in Algorithm 1 . There are two phases: recovering the noise under a base LLM and replaying the noise values to sample from an intervened LLM. To recover the Gumbel noise that corresponds to each token of a reference, for each autoregressive timestep t t , we derive a noise vector 𝐠 t ∈ ℝ V \mathbf{g}{t}\in\mathbb{R}^{V} that satisfies the condition: ℓ ​ [ y t ] + 𝐠 t ​ [ y t ] > ℓ ​ [ v ] + 𝐠 t ​ [ v ] ​ ∀ v ∈ V , v ≠ y t . \ell[{y{t}}]+\mathbf{g}{t}[{y{t}}]>\ell[v]+\mathbf{g}{t}[v]\ \forall v\in V,v\neq y{t}. (1) Here, V V is the LLM’s vocabulary, y t y_{t} indexes the factual reference token at timestep t t , and ℓ ​ [ y t ] \ell[{y_{t}}] is the associated logit. We omit the subscript r r of the reference y r y_{r} for clarity in time-based indexing. This condition ensures that the observed factual token y t y_{t} is selected during the Gumbel-Max prediction. In order to satisfy Equation 1 , we form a noise vector by first sampling the target noise from a lower-truncated Gumbel distribution. LowerTruncG ​ ( τ m ​ i ​ n ) ≔ P ​ ( G ​ ∣ G > ​ τ m ​ i ​ n ) . \mathrm{LowerTruncG}(\tau_{min})\coloneqq P\bigl(G\mid G>\tau_{min}\bigr). Here, τ m ​ i ​ n \tau_{min} denotes the lower bound of a truncated standard Gumbel distribution. We then compute τ m ​ i ​ n = ℓ m ​ a ​ x − ℓ ​ [ y t ] \tau_{min}=\ell_{max}-\ell[{y_{t}}] , the difference between the ground truth token logit and the maximum observed logit, so that the augmented logit for the reference token will be greater than the logit for any other token at each timestep. Algorithm 1 β \beta -Hindsight Control 1: def BetaHindsight ( 𝐱 , 𝐲 , 𝐱 ′ , θ , β ) (\mathbf{x},\mathbf{y},\mathbf{x}^{\prime},\theta,\beta) : 2: 𝐆 ← ∅ \mathbf{G}\leftarrow\varnothing 3: for t = 1 , … , | 𝐲 | t=1,\dots,|\mathbf{y}| : ⊳ \triangleright Recover Noise 4: ℓ ← LLM ​ ( 𝐱 , 𝐲 < t ; θ ) \ell\leftarrow\mathrm{LLM}(\mathbf{x},\mathbf{y}{<t};\theta) 5: ℓ m ​ a ​ x ← max v ∈ V ⁡ ℓ ​ [ v ] \ell{max}\leftarrow\max_{v\in V}\ell[v] 6: 𝐠 t ​ [ y t ] ∼ LowerTruncG ​ ( ℓ m ​ a ​ x − ℓ ​ [ y t ] ) \mathbf{g}{t}[y{t}]\sim\mathrm{LowerTruncG}(\ell_{max}-\ell[y_{t}]) 7: T t ← ℓ ​ [ y t ] + 𝐠 t ​ [ y t ] T_{t}\leftarrow\ell[y_{t}]+\mathbf{g}{t}[y{t}] 8: for v ∈ V ∖ { y t } v\in V\setminus{y_{t}} : 9: 𝐠 t ​ [ v ] ∼ UpperTruncG ​ ( T t − ℓ ​ [ v ] ) \mathbf{g}{t}[v]\sim\mathrm{UpperTruncG}(T{t}-\ell[v]) 10: store 𝐠 t \mathbf{g}{t} in 𝐆 \mathbf{G} 11: 𝐲 ′ ← ∅ \mathbf{y}^{\prime}\leftarrow\varnothing 12: y 1 ′ ← BOS , t ← 1 y^{\prime}{1}\leftarrow\mathrm{BOS},;t\leftarrow 1 13: while y t ′ ≠ EOS y^{\prime}{t}\neq\mathrm{EOS} : ⊳ \triangleright Replay noise 14: ℓ ← LLM ​ ( 𝐱 ′ , 𝐲 < t ′ ; θ ) \ell\leftarrow\mathrm{LLM}(\mathbf{x}^{\prime},\mathbf{y}^{\prime}{<t};\theta) 15: if t ≤ | 𝐲 | t\leq|\mathbf{y}| : 16: 𝐠 t ← 𝐆 ​ [ t ] \mathbf{g}{t}\leftarrow\mathbf{G}[t] 17: y t ′ ← arg ⁡ max v ∈ V ⁡ ( ℓ ​ [ v ] + β ⋅ 𝐠 t ​ [ v ] ) y^{\prime}{t}\leftarrow\arg\max_{v\in V}\bigl(\ell[v]+\beta\cdot\mathbf{g}{t}[v]\bigr) 18: else : 19: 𝐮 t ∼ Gumbel ​ ( 0 , 1 ) | V | \mathbf{u}{t}\sim\mathrm{Gumbel}(0,1)^{|V|} 20: y t ′ ← arg ⁡ max v ∈ V ⁡ ( ℓ ​ [ v ] + 𝐮 t ​ [ v ] ) y^{\prime}{t}\leftarrow\arg\max{v\in V}\bigl(\ell[v]+\mathbf{u}{t}[v]\bigr) 21: append y t ′ y^{\prime}{t} to 𝐲 ′ \mathbf{y}^{\prime} , t ← t + 1 t\leftarrow t+1 22: return 𝐲 ′ \mathbf{y}^{\prime} The sampled target noise establishes a ceiling, T t = ℓ ​ [ y t ] + g ​ [ y t ] T_{t}=\ell[{y_{t}}]+g[y_{t}] , the new logit value for the reference token. Subsequently, for all other tokens v ≠ y t v\neq y_{t} , we sample their corresponding noise values from an upper-truncated Gumbel distribution, such that no resulting token logit exceeds the ceiling. UpperTruncG ​ ( τ m ​ a ​ x ) ≔ P ​ ( G ∣ G < τ m ​ a ​ x ) . \mathrm{UpperTruncG}(\tau_{max})\coloneqq P\bigl(G\mid G<\tau_{max}\bigr). Here, τ m ​ a ​ x \tau_{max} denotes the upper bound of a truncated standard Gumbel distribution. We compute τ m ​ a ​ x = T t − ℓ ​ [ v ] \tau_{max}=T_{t}-\ell[v] , the difference between the ceiling logit value and the actual logit value, for each of the other tokens in the vocabulary, so that none of the resulting logit values would exceed the ceiling. To efficiently sample from the truncated Gumbel distributions, we perform inverse transform sampling Devroye ( 1986 ) , mapping the uniform samples to the truncated support range before applying the inverse CDF. The recovered Gumbel noise values satisfy ( 1 ) by construction and represent a stochastic state that would have led the LLM to output the token in the reference, provided the prompt and preceding tokens. Given the recovered noise values, we then replay the noise to generate the counterfactual output under a modified prompt as input. The replay procedure follows from the Gumbel-max trick, in which we additionally modulate the noise distribution with a scaling hyperparameter β \beta . This hyperparameter controls the strength of similarity during sampling, with higher values inducing LLM outputs more similar to the reference. 3.3 Instruction Alignment for Task Validity We now detail our approach to encourage validity in generated counterfactual student work. We frame this validity alignment problem as an instruction-following task, in which we prompt a language model with explicit instructions on how to generate a valid counterfactual. In each prompt, we include a rubric detailing the criterion to change, the qualitative aspects of each possible ordinal score z z , and the reference student work 𝐲 𝐫 \mathbf{y_{r}} . We conclude the prompt with instructions for the model to generate a revised version y ′ y^{\prime} that adheres closely to the reference while reflecting a specified score z ′ z^{\prime} . We include the prompts in Appendix A.2 . We find that training with preference alignment substantially improves task validity. We first perform supervised fine-tuning (SFT) Ouyang et al. ( 2022 ) using examples derived from a dataset of ( 𝐲 , z ) (\mathbf{y},z) tuples, where z z is used in forming the prompt x and y is the target. During training, we use the same prompt format but exclude the reference and related adherence instructions. After conducting SFT, we further align the model to the counterfactual task using direct preference optimization (DPO) Rafailov et al. ( 2023 ) . To construct preference pairs, we use student examples that achieve a target, ground-truth score on a given criterion as a preferred example, and those that receive an alternative score for the same criterion as a negative example. For each example in the training dataset, we randomly select one negative sample for each alternative score on the same prompt, yielding k − 1 k-1 preference pairs per example. Together, instruction-based prompting, supervised fine-tuning, and preference alignment enable reliable task validity and are compatible with β \beta -Hindsight similarity control. 4 Experimental Setup In this section, we detail our experiments to evaluate the effectiveness of our approach. 4.1 Datasets We evaluate our approach on two datasets of scored student writing samples. The first dataset is Common Lit Augmented Student Summary Evaluation ( CLASSE ), a corpus of student-authored summaries of short reading passages Crossley et al. ( 2024 ) . The dataset contains 4,689 summaries of 101 unique passages authored by students in grades 3-12. The second dataset is Dataset for Rubric-based Essay Scoring on EFL Writing ( DREsS ), a corpus of student-authored responses to open-ended writing prompts Yoo et al. ( 2025 ) . We use DREsS New \text{DREsS}{\text{New}} , which contains 2,279 essays from 65 essay writing prompts authored by university undergraduate students enrolled in an English as a foreign language (EFL) writing course. The student writings in both datasets are graded by experts with ordinal rubrics on multiple criteria for writing quality, 5 for CLASSE and 3 for DREsS. We further detail criterion definitions, token statistics, and scoring rubrics for the datasets in Appendix A.6 . 4.2 Metrics We measure two competing properties of the generated counterfactuals, i) similarity to a factual example and ii) task validity, following existing work on counterfactual generation Wang et al. ( 2024 ) . Similarity In our experiments, we measure the similarity between the reference student summary and the counterfactual student summary. We report the complement of the normalized character-level Levenshtein distance, i.e., one minus the normalized distance, averaged across all N N counterfactual summaries for a given setting: Similarity = 1 N ​ ∑ i = 1 N ( 1 − Lev ​ ( y i , y i ′ ) max ​ ( ∥ y i ∥ , ∥ y i ′ ∥ ) ) . \text{Similarity}=\frac{1}{N}\sum{i=1}^{N}\big(1-\frac{\text{Lev}(y_{i},y_{i}^{\prime})}{\text{max}(\lVert y_{i}\rVert,\lVert y_{i}^{\prime}\rVert)}\big). Here, y i y_{i} is the reference summary and y i ′ y_{i}^{\prime} the counterfactual summary. This metric outputs a score in [ 0 , 1 ] [0,1] where 1 indicates maximum similarity. Validity A generated counterfactual is valid if its score, given by the scoring model, reflects the intended score for the specified criterion. Existing evaluation metrics for validity, such as flip rate Ross et al. ( 2021 ) , are for nominal tasks (e.g., sentiment or topic classification). Since scores in our task are ordinal, we use average Quadratic Weighted Kappa (QWK) across scoring criteria between the intended score z ′ z^{\prime} and an estimated score from the scoring model s ​ ( 𝐲 ′ ) = z ^ ′ s(\mathbf{y}^{\prime})=\hat{z}^{\prime} . QWK produces a score in [ − 1 , 1 ] [-1,1] , with high values indicating higher task validity. We use an LLM-as-a-judge Li et al. ( 2025 ) system to compute validity using a scoring rubric. Specifically, we fine-tune GPT-4.1-mini for each criterion across both datasets. We evaluate the average QWK agreement across scoring criteria between this model and human scores for CLASSE and DREsS, which are 0.6242 and 0.6012, respectively. This agreement is very close to the 0.642 inter-rater agreement for CLASSE (DREsS is single-scored), indicating that our scoring model is a good proxy of human scoring. We detail experiments on multiple approaches for automated validity scoring in Appendix A.4 . CLASSE DREsS Approach Sim ↑ \text{Sim}^{\uparrow} Val ↑ \text{Val}^{\uparrow} Sim ↑ \text{Sim}^{\uparrow} Val ↑ \text{Val}^{\uparrow} ME 0.314 0.314 0.183 0.183 0.305 0.305 0.188 0.188 ICL 0.308 0.308 0.327 ¯ \underline{0.327} 0.330 0.330 0.264 0.264 I&R 0.498 0.103 0.103 0.308 0.308 0.325 0.325 GM (Ours) 0.391 ¯ \underline{0.391} 0.668 ∗ \textbf{0.668}^{} 0.477 ∗ \textbf{0.477}^{} 0.404 ∗ \textbf{0.404}^{} Table 1: Comparing GM to other counterfactual generation methods on both datasets (ME - Minimal-Edit, ICL - In-Context Learning, I&R - Identify-and-Replace, GM - Gumbel Machine). 4.3 Experimental Settings and Details In this section, we detail the settings and baseline approaches we use to evaluate our approach. Counterfactual Generation Baselines. We compare our approach, Gumbel Machine (GM) , to three other LLM-based counterfactual text generation strategies. Minimal-Edit (ME) : Following Gat et al. ( 2023 ) , we include in the prompt instructions for the LLM to minimally edit the reference as a zero-shot prompting approach. In-Context Learning (ICL) : Following Li et al. ( 2024 ) , we further include in-context demonstrations of each score level in the prompt. Identify-and-Replace (I&R) : Following Nguyen et al. ( 2024 ) , we instruct the LLM to perform a series of steps to (i) identify word spans leading to the original score, (ii) change the identified spans to reflect the desired score, and (iii) construct a new summary with the modified spans. These instructions resemble a “Chain-of-Thought (CoT)” for our task Wei et al. ( 2022 ) . CLASSE DREsS Model Method Sim ↑ \text{Sim}^{\uparrow} Val ↑ \text{Val}^{\uparrow} Sim ↑ \text{Sim}^{\uparrow} Val ↑ \text{Val}^{\uparrow} Haiku ME 0.280 0.280 0.144 0.144 0.312 0.312 0.093 0.093 ICL 0.281 0.281 0.312 0.312 0.314 0.314 0.125 0.125 I&R 0.451 0.451 0.388 0.388 0.384 0.384 0.441 ¯ \underline{0.441} Sonnet ME 0.376 0.376 0.344 0.344 0.375 0.375 0.439 0.439 ICL 0.359 0.359 0.604 ¯ \underline{0.604} 0.418 0.418 0.484 I&R 0.554 ¯ \underline{0.554} 0.434 0.434 0.477 ¯ \underline{0.477} 0.405 0.405 Llama GM (Ours) 0.391 0.391 0.668 ∗ \textbf{0.668}^{} 0.477 ¯ \underline{0.477} 0.404 0.404 Qwen GM (Ours) 0.575 ∗ \textbf{0.575}^{} 0.566 0.566 0.593 ∗ \textbf{0.593}^{} 0.315 0.315 Table 2: Comparing counterfactual LLM approaches across proprietary and open-weight models (ME - Minimal-Edit, ICL - In-Context Learning, I&R - Identify-and-Replace, GM - Gumbel Machine). GM translates across open-weight models and has comparable performance to larger proprietary models. LLMs. We experiment with two open-weight models and two proprietary ones. We use LLaMA-3.1-8B-Instruct as our primary open-weight model due to its strong instruction-following behavior and wide community adoption Grattafiori et al. ( 2024 ) , and Qwen3-4B-Instruct Yang et al. ( 2025 ) , which uses a different training pipeline, to evaluate whether our approach generalizes across different models. We also use Claude Haiku 4.5 Anthropic ( 2025a ) , a lightweight prompting-only baseline at a comparable capability tier to our open-weight models, and Claude Sonnet 4.6 Anthropic ( 2025b ) , a larger and more capable proprietary model. This model choice for proprietary, prompting-based baselines mitigates bias when an LLM evaluator and a generator share a model family Panickssery et al. ( 2024 ) , since our scoring model is based on GPT-4.1-mini . Per-side Likert (1–3) Pairwise Conditional on sim=GM Dimension GM ME Wilcoxon p p GM/ME/Tie GM win% Binomial p p GM win% (GM/ME/Tie) Score Improvement 2.58 2.64 0.41 83/82/35 50% 1.00 72% (44/17/3) Similarity 1.77 1.67 0.03 64/39/97 62% 0.02 — Utility 2.43 2.46 0.75 88/78/34 53% 0.49 75% (47/16/1) Table 3: Human evaluation results across four evaluators comparing GM (Gumber Machine) to the ME (Minimal-Edit) baseline. Evaluators give GM counterfactuals higher similarity ratings in comparative and standalone evaluations. In those cases, generated feedback is also rated more useful. Bold indicates statistical significance. Data Sampling Strategy. We generate counterfactuals for a stratified subset of score changes sampled from a held-out validation set for each dataset. Specifically, we perform an 80-20 train-validation split and sample uniformly from the validation split across all ordered score transitions (e.g., 4 → 1 4\rightarrow 1 , 2 → 3 2\rightarrow 3 ), capping the total at 400 examples per rubric crit

Comments (0)

No comments yet

Be the first to share your thoughts!