Back to AI Research

AI Research

Who Defines "Best"? Towards Interactive, Us... | AI Research

Key Takeaways

  • What the paper is about LLM leaderboards are widely used to compare models and guide deployment decisions.
  • LLM leaderboards are widely used to compare models and guide deployment decisions.
  • However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations.
  • A single aggregate score often obscures how models behave across different prompt types and compositions.
  • Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope.
Paper AbstractExpand

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.

What the paper is about

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.

What it covers

\setcctype by Who Defines ”Best”? Towards Interactive, User-Defined Evaluation of LLM Leaderboards Minji Jung Yonsei University Seoul South Korea [email protected] , Minjae Lee Yonsei University Seoul South Korea [email protected] , Yejin Kim Yonsei University Seoul South Korea [email protected] , Sarang Choi Yonsei University Seoul South Korea [email protected] and Minsuk Kahng Yonsei University Seoul South Korea [email protected] (2026) Abstract. LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards. LLM evaluation, disaggregated evaluation, LLM leaderboards, interactive data visualization, human-computer interaction, responsible AI † † booktitle: \conffull ( \confshort ), \confdate , \confloc † † journalyear: 2026 † † copyright: cc † † conference: The 2026 ACM Conference on Fairness, Accountability, and Transparency; June 25–28, 2026; Montreal, QC, Canada † † booktitle: The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’26), June 25–28, 2026, Montreal, QC, Canada † † ccs: Computing methodologies Natural language processing † † ccs: Human-centered computing Interactive systems and tools † † ccs: Human-centered computing Visualization 1. Introduction Large Language Models (LLMs) are increasingly evaluated, compared, and selected through benchmark datasets and leaderboards. These leaderboards play a central role in shaping research narratives, deployment decisions, and public perceptions of model quality. Yet, despite their influence, most evaluation frameworks are designed and fixed by a small group of benchmark creators, while their results are consumed by a much broader and more diverse audience. This asymmetry raises a fundamental concern: evaluation outcomes are largely determined by data composition and aggregation choices made by others, and then inherited by users whose goals, contexts, and use cases may differ substantially. Leaderboard rankings implicitly reflect which prompts are included, how frequently they appear, and how results are aggregated into a single score. Global aggregate rankings obscure substantial variation across topics, domains, or task types. For many real-world scenarios, performance on particular data subsets may matter far more than overall averages, yet such distinctions are rarely visible or actionable. One common response to this limitation is to disaggregate evaluation results (Barocas et al. , 2021 ; Madaio et al. , 2022 ) , for example by reporting performance across predefined slices or by using automated methods to identify meaningful subgroups in the data. While valuable, these approaches typically assume that the structure and relative importance of slices are fixed in advance. In practice, however, users often need to explore how different parts of the data contribute to overall results, and to decide for themselves which subsets deserve greater emphasis based on their context. In this paper, we argue that LLM evaluation should support this interactive exploration and reweighting process. Rather than asking users to take a fixed global ranking, or a fixed set of disaggregated views, we propose treating leaderboard evaluation as a sensemaking activity, where users can examine data slices, adjust their relative importance, and observe how rankings change as a result. To ground this argument, we conduct an in-depth analysis of the dataset used in the popular LMArena (formerly Chatbot Arena) benchmark (Chiang et al. , 2024 ) . It reveals substantial skews in prompt coverage and pronounced variability in model rankings across different data slices. These findings demonstrate how a single aggregate ranking can reflect the dominant composition of the dataset, while concealing trade-offs that emerge when attention shifts to specific subsets. Building on this analysis, we design an interactive visualization tool as a design probe to explore how users might engage with disaggregated leaderboard results. The interface exposes the composition of the evaluation dataset and allows users to define important prompt slices and adjust their relative emphasis, making it possible to observe how model rankings change as users shift their attention across different subsets of data. The visualization serves as a concrete instantiation of our analysis to examine how user interaction can support more contextual interpretation of leaderboard outcomes. We conducted a small qualitative study (N=10) to understand how practitioners would use this design in their model selection process. Participants used our system to inspect different prompt types and explore how emphasizing different slices affected rankings under their needs and constraints. These observations suggest that interactive, slice-based views can help users move beyond global rankings and interpret leaderboard results more critically. This work makes three primary contributions:

• A reframing of LLM leaderboard evaluation as an interactive process of exploring and reweighting data slices, rather than relying on fixed aggregate evaluations.

• An empirical analysis of the LMArena leaderboard data, demonstrating how dataset composition can affect model rankings.

• An interactive visualization interface as a design probe, with qualitative findings illustrating how more contextual and transparent interpretation of leaderboard results could work in practice. 2. Related Work 2.1. Benchmarking Practices in LLM Evaluation While benchmarking has become the de facto standard for AI evaluation, its dominance has introduced significant practical limitations. Current benchmarks cannot represent the full scope of real-world use; they act as finite samples rather than measures of general intelligence (Raji et al. , 2021 ) . The overemphasis on leaderboard competition has transformed evaluation into what some call “AI as a sport,” where beating benchmark scores becomes the goal (Orr and Kang, 2024 ) . Research efforts tend to cycle around a small number of datasets that are repeatedly reused and recycled, encouraging optimization for specific benchmarks at the expense of broader applicability (Koch et al. , 2021 ) . These practices assume that benchmark test sets are representative of the real world, but when datasets reflect biased or incomplete representations, the evaluations themselves become systematically biased (Buolamwini and Gebru, 2018 ) . To assess diverse capabilities of language models, the field has increasingly adopted large-scale, multi-task benchmarks (Hendrycks et al. , 2021 ; Srivastava et al. , 2023 ; Chung et al. , 2024 ; Suzgun et al. , 2023 ) . In parallel, evaluation practices are shifting toward holistic approaches that aim to capture a broader range of metrics and model behaviors (Shevlane et al. , 2023 ; Liang et al. , 2022 ) . For example, HELM (Liang et al. , 2022 ) assesses model performance beyond accuracy, incorporating measures such as calibration, robustness, fairness, toxicity, and efficiency. However, as benchmarks expand to encompass more tasks and dimensions, their aggregate rankings become increasingly unstable and sensitive to seemingly minor changes (Zhang and Hardt, 2024 ) . 2.2. Preference-based Evaluation and Leaderboards The evaluation of LLMs has shifted from reference-based metrics toward preference-based judgments, reflecting the open-ended and subjective nature of language generation tasks. The LMSYS group popularized this approach by using pairwise preference comparisons, initially employing strong LLMs as judges to provide a scalable proxy for model quality (Team, 2023 ; Zheng et al. , 2023 ; Li et al. , 2024b ; Dubois et al. , 2024 ) . Building on this work, the same group introduced Chatbot Arena (Chiang et al. , 2024 ) , later rebranded as LMArena, 1 1 1 Note that the platform was again rebranded from LMArena to Arena in late January 2026. In this paper, we use LMArena to reflect the name when most of this research was conducted. where users compare two model responses side-by-side and vote for their preferred response. These crowdsourced judgments are aggregated using an Elo-based rating system to produce a continuously updated leaderboard intended to reflect real-world human utility. Recent studies have revealed fundamental limitations in LMArena-style platforms and preference-based evaluations. Aggregation mechanisms such as Elo obscure performance heterogeneity across different tasks (Rofin et al. , 2023 ; Lanctot et al. , 2023 ; Boubdir et al. , 2024 ; Singh et al. , 2025 ) , and evaluations occur disproportionately for queries that are very easy or highly objective, meaning ties are driven more by intrinsic query properties than by model capability (Tang et al. , 2025 ) . The systems are also vulnerable to adversarial manipulation (Min et al. , 2025 ; Huang et al. , 2025 ; Frick et al. , 2025 ) . Perhaps most critically, the preference-based evaluation tends to prioritize perceived helpfulness, while weakly constraining other important dimensions, like factual correctness, honesty, and safety (Feuer et al. , 2024 ; Chen et al. , 2024 ; Wu and Aji, 2025 ) . These limitations stem not from any specific evaluation method, but from the structural reliance on aggregated preference signals, motivating our focus on disaggregated and transparent evaluation approaches. 2.3. Subgroup Analysis and Interactive Slicing Global aggregate metrics often obscure performance heterogeneity across diverse domains, demographic groups, and user intents. Failing to disaggregate performance data erases minority group experiences and reinforces systemic social inequities (Buolamwini and Gebru, 2018 ) , which can be particularly consequential in high-stakes domains like healthcare (Obermeyer et al. , 2019 ; Seyyed-Kalantari et al. , 2021 ) . Aggregate metrics are structurally incapable of ensuring fairness and responsibility in AI systems, creating a critical need for disaggregated evaluation (Herlihy et al. , 2024 ; Diaz and Madaio, 2024 ; Khodak et al. , 2024 ; Pfohl et al. , 2025 ) . Behavioral testing methodologies similarly emphasize the importance of systematic, fine-grained probing to expose localized failure modes (Ribeiro et al. , 2020 ; Wu et al. , 2019 ) . These approaches can ensure that the benefits of AI are distributed equitably across a range of stakeholders and use cases (Barocas et al. , 2021 ; Madaio et al. , 2022 ) . A key insight from the human-computer interaction (HCI) community is that disaggregated evaluation must be interactive. Rather than presenting users with a fixed set of slices, interactive tools allow users to define, explore, and compare slices that matter for their specific context (Cabrera et al. , 2023 ; Sivaraman et al. , 2025 ; Kahng et al. , 2025 ) . These systems leverage visual analytics techniques that help users make sense of complex patterns through iterative, human-guided exploration, emphasizing user agency (Cabrera et al. , 2019 ; Zhang et al. , 2022 ; Wang et al. , 2024 ) . This interactive philosophy extends beyond model evaluation to documentation. The Interactive Model Cards suggest that static documentation is insufficient for diverse stakeholders, advocating for tools that allow users to interactively probe model boundaries (Crisan et al. , 2022 ) . Our work builds on this interactive slicing approach to leaderboard evaluation, a setting that has received little attention in prior work. While existing tools focus on analyzing individual models or pairwise comparisons, we enable users to interactively define evaluation priorities that simultaneously analyze the relative standing of many models. 2.4. Interactive Ranking and User-Defined Evaluation While most leaderboards present fixed rankings that apply uniformly to all users, recent work has begun exploring more flexible approaches. The Prompt-to-Leaderboard framework optimizes for personalized, prompt-conditioned leaderboards (Frick et al. , 2025 ) , and Arena-Hard-Auto prioritizes more informative or challenging prompts through automated selection (Li et al. , 2024b ) . However, these approaches still rely on predetermined notions of what makes a good evaluation. As Jury Learning demonstrates in the context of annotation (Gordon et al. , 2022 ) , disagreement and diverse perspectives are not noise to be eliminated but signal to be preserved (Aroyo and Welty, 2015 ) . Similarly, the composition of the prompt dataset shapes what counts as ”best,” and this is inherently a value-laden process. To address the need for context-specific evaluation, we draw on principles from multi-attribute ranking visualizations (Gratzl et al. , 2013 ; Seo and Shneiderman, 2005 ; Pajer et al. , 2016 ; Wall et al. , 2017 ) . Systems like LineUp (Gratzl et al. , 2013 ) demonstrate the power of visualization for interactively exploring trade-offs across multiple dimensions. Similarly, Dynaboard (Ma et al. , 2021 ) enables users to customize a scoring function by adjusting weights across metrics such as accuracy, robustness, and efficiency. While recent work has explored user-defined criteria for LLM evaluation (Kim et al. , 2024 ; Shankar et al. , 2024 ) , we take a complementary approach. We focus on leveraging existing evaluation data, rather than asking users to define abstract criteria from scratch. We aim to support the fundamental visualization principle of revealing patterns through interaction, so that users can discover insights about model behavior (North, 2006 ) . This transforms leaderboard evaluation from passive consumption into active sensemaking, supporting more transparent and context-specific model selection. 3. Dataset Analysis 3.1. Dataset Properties and Topic Distribution 3.1.1. Dataset Overview We analyze the Human Preference 140K dataset, 2 2 2 https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k the latest release from the LMArena platform. The dataset comprises preference-based judgments for 53 LLMs, collected over about three months from April to July 2025. On the LMArena website, users see responses from two different LLMs side-by-side and vote for their preference: Model A wins, Model B wins, Tie, or Both Bad. The dataset contains 135,634 judgments: Model A won 35.8%, Model B won 36.7%, and Ties (including Both Bad) accounted for 27.5% of cases. It also includes metadata for each prompt, such as category tags (e.g., whether it contains code, or is about mathematics) and language. Figure 1. Treemap visualization of the topic distribution of the LMArena dataset. We construct a three-level topic hierarchy and visualize the top two levels in this figure; rectangle areas are proportional to the number of prompts in each category. The distribution reveals a clear skew toward developer- and AI-related topics, which together account for 30% of the dataset, reflecting the interests of a specific population. 3.1.2. Semantic Topic Hierarchy To analyze the semantic composition of the dataset, we construct a three-level topic hierarchy through a two-stage pipeline: low-level grouping via clustering, followed by higher-level organization via LLMs with manual refinement. Low-level clustering. To avoid grouping prompts by lexical similarity rather than semantic intent, we first generate a short English topic description using GPT-5 mini (prompt in Appendix A.1.1 ), following Lam et al. (Lam et al. , 2024 ) and Tamkin et al. (Tamkin et al. , 2024 ) . We then compute embeddings using OpenAI’s text-embedding-3-small model and apply k k -means clustering. The choice of k k involves a trade-off: as k k increases, clusters become more descriptively specific: clusters at k ≥ 400 k\geq 400 contain significantly more unique terms than clusters at smaller k k values, but each cluster’s smaller sample size widens confidence intervals and increases the probability that any two models’ score distributions overlap. We evaluated k ∈ { 100 , 200 , … , 600 } k\in{100,200,...,600} and found that this probability grows nearly linearly. We selected k k to be 400 as an empirical middle ground that preserves descriptive specificity while maintaining reasonable statistical power. Higher-level categories. We organize the 400 clusters into top- and mid-level categories following LLM-based topic clustering methods where an LLM proposes candidate groupings, similar to Pham et al. (Pham et al. , 2024 ) and Wang et al. (Wang et al. , 2023 ) . We use the more capable GPT-5.2 for this more abstract grouping task (see Appendix A.1.3 for the prompt). Since fully automated hierarchy construction (Tamkin et al. , 2024 ) produced categories that were not consistently coherent at the same level of abstraction, we refined the hierarchy through manual review. Specifically, 10% of higher-level categories are manually added, and 13% of clusters are reassigned. To assess robustness, we repeated the pipeline twice for the six different k k ’s and compared which prompts fell into high- vs. low-divergence mid-level categories (top and bottom 20%); prompt-level agreement was 86% (Cohen’s κ \kappa =0.71), suggesting that divergence patterns are reasonably consistent across clustering configurations. The final hierarchy consists of 8 top-level, 53 mid-level, and 400 fine-grained categories. Figure 1 presents a treemap visualization for the distribution of the top- and mid-level categories, with rectangle areas proportional to the number of prompts in each category. We observe three notable patterns: (1) First, programming and software development prompts dominate the dataset, accounting for about 30% of all prompts across multiple categories. This reflects overrepresentation of a specific population (e.g., software developers and AI practitioners) whose needs and preferences may differ substantially from the broader user base. This pattern is consistent with prior analyses of LMArena and similar corpora (Zhao et al. , 2024 ; Tamkin et al. , 2024 ) . (2) Second, the dataset contains over 1,000 prompts that consist only of simple greetings (e.g., “hi there”) that appear to be system tests (Li et al. , 2024a ) . Interestingly, despite minimal differences between model outputs in many cases, users selected a winner in 79% of these cases, suggesting arbitrary choices that add noise to aggregate rankings. (3) Third, we observe many highly repetitive prompts that sometimes form a distinct cluster in our hierarchy. For example, “How many ‘r’s are there in strawberry?” appears 205 times with only minor wording variations. This question has been used within the user community to test model behavior and is submitted by many users, which can distort rankings. Figure 2. Heatmap visualization of model performance across mid-level prompt categories. Rows represent models with at least 4,000 evaluations, sorted by overall win rate. The first column shows each model’s overall win rate, and the remaining columns correspond to mid-level categories ordered by their Spearman correlation with the overall ranking, from left (high correlation) to right (low correlation), with Data Processing & Analysis appearing at the far right. Notably, this lowest-correlation category shows that Claude-family models achieve relatively higher win rates compared to their overall performance. Each cell encodes the model’s win rate for the corresponding category; deviations from surrounding colors highlight category-specific performance differences, such as the high win rate of minimax-m1 in Math, visible as a dark green cell toward the middle right. 3.2. Category-Specific Ranking Differences Model performance can vary substantially across different prompt categories. To understand this variability, we analyze how rankings change across the mid-level categories in our topic hierarchy. 3.2.1. Which Categories Show Different Rankings? For each mid-level category, we compute per-model win rates and compare their ranking with the overall ranking using Spearman’s rank correlation. We use Bayesian smoothing based on the Beta-Binomial model to account for uncertainty with small sample sizes (Murphy, 2012 ) , and exclude ties. The top three categories with highest correlations are Applied Engineering, Home Maintenance, and Server & Storage (over 0.93); the bottom three are Data Processing & Analysis (0.60), Web Frontend (0.70) and Simple Queries & Quick Facts (0.76). To better understand these divergences, we examine Data Processing & Analysis , the lowest-correlation category, in detail. This category shows dramatic ranking differences. Claude has eight models in the dataset, none ranking in the overall top 20, but in this category, three Claude models jump to the top 5. To understand what drives these wins, we prompt an LLM to generate rationales for why users preferred winning responses. (detailed prompt is in Appendix A.2 ). The LLM consistently attributes Claude’s wins to correctness and precision in handling structured data, but prompts requiring such complexity are rare in a dataset dominated by lighter queries. 3.2.2. Which Models Perform Differently in Specific Categories? The above analysis identified categories where rankings differ from the overall ranking. We now examine a finer-grained question: which specific models show unusually strong or weak performance in particular categories, even when the category as a whole may not deviate substantially? We first visualize win rates across models and categories as a heatmap (shown in Figure 2 ), where rows represent models and columns represent mid-level categories sorted by the Spearman’s correlation values, with each cell showing the model’s win rate. While many models perform consistently well across categories, some show isolated spikes. We quantify these deviations using a two-proportion z-test (Fleiss et al. , 2013 ) , comparing each model’s win rate within a category to its win rate across all other categories. Higher absolute z-scores indicate larger differences. The strongest deviation is observed for minimax-m1 in the math category, where the z z -score exceeds 8. The model achieves 138 wins and only 30 losses. At the fine-grained subcategory level, it excels particularly in algebra question solving (95% win rate, z = 5.43 z=5.43 ) and advanced number theory (92%, z = 4.16), which indicates that the model performs well on difficult computational tasks, rather than on simple routine problems. These findings reveal that models can excel on specific task types, such as complex coding tasks that emphasize correctness and precision. However, because such tasks constitute a small portion of the dataset, strong performance on them is underrepresented in aggregate rankings. This raises a fundamental question: can preference-based evaluation meaningfully aggregate across tasks with fundamentally different characteristics—those with deterministic answers, those involving subjective preferences, and those requiring value-dependent judgments? 3.3. Limits of Preference-Based Evaluation Across Deterministic and Pluralistic Tasks To better understand when aggregate preference-based evaluation can misrepresent model behavior, we examine settings where such aggregation is most likely to break down. Rather than aiming for exhaustive coverage of all prompt types, we focus on two representative extremes: (1) prompts with deterministic, objectively correct answers; and (2) prompts involving inherently pluralistic, value-dependent judgments. 3.3.1. Objective Tasks with Deterministic Answers We analyze 8,035 prompts from the mathematics category of the LMArena dataset, using the category metadata provided in the dataset. However, not all prompts in this category have deterministic answers. We exclude prompts with ambiguous meaning or no deterministic answer using GPT-5.2 by asking it to mark mathematical questions with a well-defined verifiable answer (detailed prompt is in Appendix A.3.1 ). We classify a total of 2,773 prompts as mathematical problems with a definitive answer. Answer Correctness. We utilize GPT-5.2 and Gemini 3 Flash to classify model responses based on correctness by asking LLMs to assess whether each response gives a correct mathematical answer to the prompt (prompt in Appendix A.3.2 ). We acknowledge that LLM-based judgments can be incorrect. To mitigate this limitation, we employ two independent models, and analyze only the 2,143 cases (4,286 responses) where both models agreed on correctness. In 23% of the cases, one model response is correct and the other is incorrect. While correctness correlates relatively highly with human preference (Spearman’s rank correlation coefficient 0.71), humans selected the correct answer as the winner only 74% of the time. Our inspection of data reveals that users made incorrect selections on problems involving complex formulas or precision-sensitive calculations, such as discount rates, taxes, or square roots. Also, we notice that in 67% of the cases, both models’ responses are correct for the same prompt. Even when both models answered correctly, humans selected a winner 56% of the time. This indicates that getting the correct answer alone does not lead to winning, and other factors influence user judgments. We hypothesize that explanation styles might affect the preference evaluation which we investigate below. Name Description Count Conciseness Delivers the core answer or solution directly with minimal explanation 2,256 Elaboration Expands beyond minimal answers with additional context or explanations 2,259 Structure Richness Organizes the response using clear formal structure, e.g., steps or tables 2,788 Reasoning with Derivation Shows step-by-step reasoning, intermediate steps, and derivations 3,022 Rigorous Assumption Handling Carefully examines assumptions, constraints, edge cases and potential ambiguities 3,069 User-oriented Interaction Actively engages the user with follow-up questions, clarification, etc. 898 Table 1. Explanation style characteristics in model responses related to the mathematics category. Six characteristics were identified through analysis using GPT-5.2. Explanation Styles. To understand how explanation style influences preference, we identify style characteristics with GPT 5.2 (see Appendix A.3.3 for the prompt). We extract and refine recurring traits by crafting a prompt with 100 randomly sampled examples, repeating for ten iterations, and merging the results. This process yields six distinct characteristics as listed in Table 1 . We then use GPT-5.2 to compare model response pairs and tag each response for the presence of these characteristics (see Appendix A.3.4 for the prompt). For response pairs where both models provide correct answers, we measure how much these sets overlap between the two responses using Jaccard similarity. High similarity indicates that the two responses share largely the same stylistic reasons for being preferred, whereas low similarity suggests that different stylistic factors may have influenced human judgments. When humans selected a winner despite both model responses being correct, the average Jaccard similarity between response pairs was 0.3, which is lower than that of randomly paired responses (0.5). This pattern suggests that human preferences in these cases might arise not because both responses succeed in similar ways, but because one response stands out along different stylistic dimensions. Among winning responses, the most frequent characteristics are conciseness (50%), elaboration (48%), and structure richness (39%). These results indicate that, in math-related problems, human preferences are shaped by specific combinations of explanation styles rather than correctness alone. Consistent with this interpretation, the minimax-m1 model, which balances concise answers with clear structure and derivational reasoning, rises from 19th globally to 1st in math, while also maintaining a high accuracy rank of 4th. Prompt type Total Count Non-pluralistic Refusal Ratio of Non-pluralistic Geopolitical conflicts 162 18 2 11.1% Value judgments 88 19 1 21.5% Political or historical issues in China 249 41 17 16.4% Human rights issues 42 3 0 16.4% Political or historical evaluation 444 29 1 6.5% Future predictions 173 10 0 5.7% Table 2. Category classification of political dispute related prompts, illustrating variation across prompt types in the proportion of model responses that take a specific position (non-pluralistic) rather than presenting pluralistic viewpoints. 3.3.2. Value-Dependent Tasks with Pluralistic Judgments Unlike tasks with a single correct answer, some prompts involve inherently value-dependent judgments, where reasonable disagreement arises from

Comments (0)

No comments yet

Be the first to share your thoughts!