ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space
As AI systems increasingly handle sensitive tasks like healthcare triage and hiring, it is critical to ensure their ethical reasoning is stable and resistant to manipulation. The Ethical Robustness Testing System (ERTS) is a new framework designed to stress-test AI models by deliberately manipulating the ethical framing of decision-making scenarios. Unlike traditional testing that focuses on raw data or text, ERTS evaluates whether an AI’s ethical judgment remains consistent when the underlying ethical variables of a situation are altered.
How the Framework Works
The core of ERTS is the Ethical Consequence Space (ECS), a 22-dimensional vector representation grounded in established philosophical traditions such as utilitarianism, deontology, and virtue ethics. Each dimension represents a specific ethical factor, such as "harm to others," "fairness impact," or "transparency."
To test a model, ERTS applies 17 different "semantic perturbation functions." These functions simulate real-world adversarial tactics, such as "authority injection" (mimicking government mandates) or "fairness corruption" (introducing systemic bias). To ensure these tests remain realistic, the system enforces six validity constraints, including a "semantic coherence" rule that prevents logically impossible scenarios, such as simultaneously increasing harm and welfare.
Measuring Ethical Stability
When a model is subjected to these perturbations, ERTS measures its performance using the Ethical Instability Index (EII). This index calculates how much the AI’s decision changes based on four factors: whether the final action flipped, how much the model's confidence dropped, the shift in underlying ethical scores, and any changes in the ranking of preferred outcomes. Based on these results, the system produces a domain-adaptive assessment, providing a "Cleared," "Conditional," or "Failed" verdict tailored to the specific requirements of fields like healthcare or finance.
Key Findings and Model Performance
In an evaluation of 50 ethical scenarios across 8 domains, researchers tested four structured baseline models and two production LLMs (Gemini 2.0 Flash and Llama 3.2). The results showed that only 33% of the models achieved assessment clearance.
Gemini 2.0 Flash performed the best, achieving an Ethical Robustness Score (ERS) of 0.940 with zero failures. In contrast, the locally deployed Llama 3.2 model struggled significantly, showing a high vulnerability to fairness corruption and information degradation. With a 22% failure rate and 30 critical failures, the Llama model was deemed unsuitable for the tested deployment contexts. These findings suggest that model scale and training methodology play a major role in how reliably an AI maintains its ethical reasoning under pressure.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!