Vulnerability of Natural Language Classifiers to Evolutionary Generated Adversarial Text
Deep learning models are increasingly used to analyze text, but they remain susceptible to "adversarial attacks"—small, calculated changes to input text that cause a model to misclassify its meaning. This paper introduces GAversary, a new method that uses a genetic algorithm to generate these adversarial examples. By treating the target model as a "black box," GAversary can successfully fool various natural language processing (NLP) models without needing to know their internal structure, requiring only the model's output scores to guide its search.
How GAversary Works
GAversary functions like an evolutionary process. It maintains a population of potential adversarial examples, which are variations of an original text. The algorithm iteratively improves these examples by selecting the most effective ones, recombining them, and applying mutations.
The key innovation is how it chooses word replacements. While other methods might pick words at random or rely on simple synonyms, GAversary uses GloVe embeddings—a tool that maps words based on their context—to suggest replacements that are semantically similar to the original. By masking a word and looking at the surrounding context, the algorithm identifies replacements that are more likely to fit naturally into the sentence while still pushing the model toward an incorrect classification.
Performance and Effectiveness
The researchers tested GAversary against established attack methods, such as BAE and A2T, using common datasets like movie reviews and news articles. The results show that GAversary is highly effective at reducing the accuracy of target models. In its best-performing scenario, GAversary reduced a model's accuracy from 76.8% down to 5.8%, significantly outperforming the BAE method, which reduced accuracy to 27.6%. This demonstrates that the genetic algorithm approach is a powerful tool for identifying vulnerabilities in NLP classifiers.
Trade-offs and Considerations
While GAversary is more successful at lowering model accuracy than competing methods, it does come with certain trade-offs. Because the algorithm prioritizes finding the most effective adversarial perturbations, it tends to modify more of the original text—typically changing nearly twice as many words as other methods. Additionally, the resulting adversarial text has slightly lower semantic similarity to the original compared to other techniques, and the process requires about 5% more run-time. These factors suggest that while GAversary is a potent tool for testing model robustness, it achieves its high success rate by being more aggressive with its modifications.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!