ERTS: Adversarial Robustness Testing of Ethical AI...

ERTS: Adversarial Robustness Testing of Ethical AI... | AI Research

Key Takeaways

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space As AI systems increasingly handle sensitive tasks...
We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases.
Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).
To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.
## ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

Paper AbstractExpand

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

As AI systems increasingly handle sensitive tasks like healthcare triage and hiring, it is critical to ensure their ethical reasoning is stable and resistant to manipulation. The Ethical Robustness Testing System (ERTS) is a new framework designed to stress-test AI models by deliberately manipulating the ethical framing of decision-making scenarios. Unlike traditional testing that focuses on raw data or text, ERTS evaluates whether an AI’s ethical judgment remains consistent when the underlying ethical variables of a situation are altered.

How the Framework Works

The core of ERTS is the Ethical Consequence Space (ECS), a 22-dimensional vector representation grounded in established philosophical traditions such as utilitarianism, deontology, and virtue ethics. Each dimension represents a specific ethical factor, such as "harm to others," "fairness impact," or "transparency."
To test a model, ERTS applies 17 different "semantic perturbation functions." These functions simulate real-world adversarial tactics, such as "authority injection" (mimicking government mandates) or "fairness corruption" (introducing systemic bias). To ensure these tests remain realistic, the system enforces six validity constraints, including a "semantic coherence" rule that prevents logically impossible scenarios, such as simultaneously increasing harm and welfare.

Measuring Ethical Stability

When a model is subjected to these perturbations, ERTS measures its performance using the Ethical Instability Index (EII). This index calculates how much the AI’s decision changes based on four factors: whether the final action flipped, how much the model's confidence dropped, the shift in underlying ethical scores, and any changes in the ranking of preferred outcomes. Based on these results, the system produces a domain-adaptive assessment, providing a "Cleared," "Conditional," or "Failed" verdict tailored to the specific requirements of fields like healthcare or finance.

Key Findings and Model Performance

In an evaluation of 50 ethical scenarios across 8 domains, researchers tested four structured baseline models and two production LLMs (Gemini 2.0 Flash and Llama 3.2). The results showed that only 33% of the models achieved assessment clearance.
Gemini 2.0 Flash performed the best, achieving an Ethical Robustness Score (ERS) of 0.940 with zero failures. In contrast, the locally deployed Llama 3.2 model struggled significantly, showing a high vulnerability to fairness corruption and information degradation. With a 22% failure rate and 30 critical failures, the Llama model was deemed unsuitable for the tested deployment contexts. These findings suggest that model scale and training methodology play a major role in how reliably an AI maintains its ethical reasoning under pressure.