Vulnerability of Natural Language Classifiers to Ev...

Vulnerability of Natural Language Classifiers to Ev... | AI Research

Key Takeaways

Vulnerability of Natural Language Classifiers to Evolutionary Generated Adversarial Text Deep learning models are increasingly used to analyze text, but they...
Deep learning models have achieved impressive performance across various fields but remain vulnerable to adversarial inputs, particularly in NLP, where such attacks can have significant real-world consequences.
This paper proposes GAversary, a hybrid Genetic Algorithm (GA) to generate adversarial attacks on natural language models.
The GA is able to treat the target model as a black box, requiring only the logit value output by the model to guide the search.
GAversary differs from GAs previously proposed for this problem by using GloVe embeddings to propose word replacements (the mutation operator) to improve the semantic similarity of the adversarial examples.

Paper AbstractExpand

Deep learning models have achieved impressive performance across various fields but remain vulnerable to adversarial inputs, particularly in NLP, where such attacks can have significant real-world consequences. Adversarial attacks often involve small, semantically similar token replacements to fool NLP models, and recent methods have become more precise by targeting specific vulnerable words, often by exploiting some level of access to the model's internal structure. This paper proposes GAversary, a hybrid Genetic Algorithm (GA) to generate adversarial attacks on natural language models. The GA is able to treat the target model as a black box, requiring only the logit value output by the model to guide the search. GAversary differs from GAs previously proposed for this problem by using GloVe embeddings to propose word replacements (the mutation operator) to improve the semantic similarity of the adversarial examples. GAversary is applied to several benchmark data sets and well-known target models. GAversary is able to substantially reduce the target model's accuracy on test data compared to the BAE and A2T attacks compared against (in the best case, reducing a 76.8% accuracy to 5.8%, compared to BAE's 27.6%). The trade-off is that GAversary perturbs just under twice as many words as the other two methods, with a slightly lower semantic similarity to the original text and around a 5% increase in run-time.

Vulnerability of Natural Language Classifiers to Evolutionary Generated Adversarial Text
Deep learning models are increasingly used to analyze text, but they remain susceptible to "adversarial attacks"—small, calculated changes to input text that cause a model to misclassify its meaning. This paper introduces GAversary, a new method that uses a genetic algorithm to generate these adversarial examples. By treating the target model as a "black box," GAversary can successfully fool various natural language processing (NLP) models without needing to know their internal structure, requiring only the model's output scores to guide its search.

How GAversary Works

GAversary functions like an evolutionary process. It maintains a population of potential adversarial examples, which are variations of an original text. The algorithm iteratively improves these examples by selecting the most effective ones, recombining them, and applying mutations.
The key innovation is how it chooses word replacements. While other methods might pick words at random or rely on simple synonyms, GAversary uses GloVe embeddings—a tool that maps words based on their context—to suggest replacements that are semantically similar to the original. By masking a word and looking at the surrounding context, the algorithm identifies replacements that are more likely to fit naturally into the sentence while still pushing the model toward an incorrect classification.

Performance and Effectiveness

The researchers tested GAversary against established attack methods, such as BAE and A2T, using common datasets like movie reviews and news articles. The results show that GAversary is highly effective at reducing the accuracy of target models. In its best-performing scenario, GAversary reduced a model's accuracy from 76.8% down to 5.8%, significantly outperforming the BAE method, which reduced accuracy to 27.6%. This demonstrates that the genetic algorithm approach is a powerful tool for identifying vulnerabilities in NLP classifiers.

Trade-offs and Considerations

While GAversary is more successful at lowering model accuracy than competing methods, it does come with certain trade-offs. Because the algorithm prioritizes finding the most effective adversarial perturbations, it tends to modify more of the original text—typically changing nearly twice as many words as other methods. Additionally, the resulting adversarial text has slightly lower semantic similarity to the original compared to other techniques, and the process requires about 5% more run-time. These factors suggest that while GAversary is a potent tool for testing model robustness, it achieves its high success rate by being more aggressive with its modifications.

Vulnerability of Natural Language Classifiers to Ev... | AI Research

Key Takeaways

How GAversary Works

Performance and Effectiveness

Trade-offs and Considerations

Comments (0)

No comments yet