Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification
Accurate classification of Harmonized Tariff Schedule (HTS) codes is a critical but difficult task in maritime logistics. These codes determine duty rates, ensure regulatory compliance, and facilitate trade statistics. However, because product descriptions are often vague or incomplete, and because the classification process must follow strict, hierarchical legal rules, automated systems frequently struggle to achieve high accuracy. This paper introduces an agentic framework that uses Large Language Models (LLMs) to perform this classification by mimicking the evidence-based reasoning of human experts, rather than relying on simple, single-step predictions.
How the Framework Works
The system functions as a multi-agent workflow that treats HTS classification as a structured, evidence-grounded task. Instead of asking an AI to guess a code in one go, the framework breaks the process into several intelligent steps:
Evidence Gathering: The system uses multi-agent information retrieval to search for details about the product and cross-references them with official tariff documents.
Hierarchical Reasoning: Because HTS codes are built in layers—from broad chapters down to specific statistical suffixes—the model validates each level of the code to ensure it remains consistent with the overall hierarchy.
Consensus-Based Validation: The framework uses "element-wise voting" and self-consistency checks. By comparing multiple reasoning paths, the system can identify when its own internal logic is conflicted.
Managing Uncertainty and Human Oversight
A key feature of this framework is its ability to recognize its own limitations. The system calculates a confidence score for its predictions. If the model determines that the product description is too ambiguous or the legal requirements are too complex to resolve with high certainty, it triggers a "human-in-the-loop" escalation. In these cases, the system generates specific questions for a human user, asking for the missing attributes—such as material composition or intended use—needed to finalize the correct code. This ensures that the process remains accountable and auditable.
Key Findings and Performance
The researchers tested the framework on a private dataset of 3,300 expert-labeled Canadian HTS records. The results highlight that even advanced LLMs find 10-digit HTS classification to be a significant challenge. The study observed a clear trend: while models are generally better at predicting broad categories (like chapters), their accuracy drops as they move toward the more granular, fine-grained tariff items and statistical suffixes.
These findings suggest that fully autonomous, single-step AI classification is risky for customs compliance. Instead, the authors argue that the industry should shift toward workflows that prioritize evidence-based reasoning, uncertainty detection, and human collaboration to ensure that trade documentation remains accurate and legally sound.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!