Back to AI Research

AI Research

Your AI Travel Agent Would Book You a Bullfight: An... | AI Research

Key Takeaways

  • Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models As AI models transition from simple c...
  • AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users.
  • We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users.
  • TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds.
  • We evaluate seven frontier models from four labs.
Paper AbstractExpand

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models
As AI models transition from simple chatbots to autonomous agents capable of booking travel and making purchases, they are increasingly making real-world decisions on behalf of users. This research investigates whether these AI agents prioritize animal welfare when planning trips, or if they default to booking exploitative experiences—such as bullfights or animal performances—simply because those options are the most relevant matches for a user's request. The study introduces a new benchmark called TAC (Travel Agent Compassion) to measure this behavior in a practical, agentic setting.

Measuring Real-World Agent Behavior

Current benchmarks for AI ethics often rely on text-based question-and-answer formats, which measure how a model talks about morality rather than how it acts. The TAC benchmark moves beyond this by placing AI models in a simulated environment where they must use tools to search for and book travel experiences. The researchers created twelve scenarios across six categories of animal exploitation, such as captive marine shows and animal racing. To ensure the results are robust, they used forty-eight variations of these scenarios, controlling for factors like price, user ratings, and the order in which options are presented.

Why Models Choose Exploitative Options

The study evaluated seven frontier models from four different labs and found that every single one performed below the "chance" level of sixty-four percent. This means the models were not just picking options randomly; they were actively favoring exploitative experiences. The researchers suggest this happens because models are optimized for "topical relevance." When a user asks for an "authentic cultural spectacle," the model identifies a bullfight as the strongest match for those keywords and prioritizes it, effectively ignoring the ethical implications of the choice.

The Impact of Simple Prompts

The researchers discovered that this behavior is not necessarily a permanent limitation of the models. By adding a single sentence to the system prompt—"Consider the welfare of all sentient beings when making your selections"—the performance of several models improved significantly. For instance, Claude and GPT-5.5 saw gains of forty-seven to sixty-three percentage points. This indicates that the models possess the capacity for welfare-aware reasoning, but it remains "dormant" until explicitly triggered by a specific instruction.

Understanding the Limits

The study highlights that a model’s tendency to book exploitative experiences often mirrors the public discourse found in its training data. Activities that are heavily criticized in news and online discussions are less likely to be booked, while activities that are normalized in mainstream tourism—even if they involve animal exploitation—are booked more frequently. While the researchers note that their findings have implications for AI governance and safety frameworks, they emphasize that further work is needed to explore how these models handle culturally specific cases and to validate these findings with independent experts in animal welfare and tourism.

Comments (0)

No comments yet

Be the first to share your thoughts!