Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
This paper addresses the challenge of detecting Personally Identifiable Information (PII) leakage in web and mobile application network traffic. Traditional detection methods often rely on manual labeling and are "locked" into specific, fixed definitions of PII, making them difficult to adapt when privacy regulations or data definitions change. The authors propose a flexible, LLM-based pipeline that can identify and extract PII values based on any privacy taxonomy provided at runtime, effectively removing the need for constant retraining.
A Multi-Stage Annotation Pipeline
The researchers developed a modular system that breaks down the complex task of PII detection into smaller, manageable steps. First, the system performs deterministic pre-processing to normalize HTTP message bodies, such as decoding URL-encoded characters or resolving HTML entities.
The core annotation process follows a two-stage approach: 1. Label-level classification: The system identifies which PII types (e.g., email, phone number) are present in the message. 2. Instance-level extraction: Once the relevant labels are identified, the system extracts the specific values associated with those labels.
Finally, a review stage acts as a quality control layer, allowing the system to correct potential errors like missed values or incorrect boundaries, ensuring higher accuracy without requiring a full re-run.
Enhancing Reliability with LLM Harnessing
To ensure the LLM performs consistently, the authors implemented a "harness" that wraps around the model. This component uses Retrieval-Augmented Generation (RAG) to provide the model with relevant examples of how to handle specific data structures or label types. By retrieving examples that match the structure of the current HTTP message, the system improves its ability to correctly identify PII boundaries. Additionally, the harness includes a validation layer that checks the model's output for structural errors and ensures that all labels used are part of the provided taxonomy, automatically correcting minor variations in formatting or spelling.
Synthetic Data for Evaluation
Because real-world PII is sensitive and difficult to use for testing, the authors created a synthetic HTTP traffic generator. This tool uses an LLM to produce realistic, fake network traffic that includes ground-truth PII annotations. This allows developers to test and evaluate their privacy auditing tools in a controlled environment without exposing actual user data. By generating diverse scenarios, the researchers were able to test their pipeline across three different PII taxonomies, demonstrating that the approach is flexible enough to handle varying levels of detail and different privacy domains.
Key Findings
The study demonstrates that LLMs can serve as a robust foundation for privacy auditing. The pipeline successfully identified and extracted PII values across different taxonomies, proving that the system is not tied to a single, static definition of sensitive data. By combining task decomposition, RAG-based prompting, and synthetic data generation, the authors provide a scalable solution for organizations that need to adapt their privacy monitoring to evolving legal frameworks and changing application requirements.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!