A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions
The Harmonized System (HS) is the global standard for classifying traded goods, but assigning the correct code is a complex, high-stakes task. Experts must navigate competing rules regarding a product's material, form, and function, often while adhering to strict legal notes. This paper addresses why standard AI models struggle with this task: they often focus on one aspect of a product while ignoring the complex, hierarchical priority rules required by customs law. The authors introduce a "deterministic agentic workflow" that replaces open-ended AI planning with a fixed, step-by-step process to ensure accurate and legally defensible classifications.
Why Standard AI Models Fail
The primary challenge in HS classification is "multi-dimensional rule reasoning." A product might be made of plastic (material), be in the form of a film (form), and be used for a phone screen (function). Customs rules often dictate that one of these factors must take priority over the others. When large language models are asked to classify a product in one go, they frequently resolve one dimension correctly but ignore the priority constraints of the others. Furthermore, these models often lack access to the specific, structured legal text required to make a correct decision, leading them to fabricate codes that do not exist.
A Fixed, Step-by-Step Workflow
Instead of allowing an AI to decide its own path, the authors created a rigid, six-stage pipeline that mirrors the structure of the HS tariff itself. The process begins by extracting key product attributes, then moves through candidate retrieval, shortlisting, and deep ranking based on specific chapter and section notes. By forcing the model to follow a fixed sequence—moving from the chapter level down to the subheading—the system ensures that every decision is grounded in the correct legal context. This design makes the AI’s reasoning "interpretable by construction," meaning the system provides verbatim citations from legal notes for every classification it makes.
Performance and Accuracy
The researchers evaluated their workflow using the HSCodeComp benchmark. Using the Qwen3.6-plus model, the system achieved a 64.2% top-1 accuracy at the six-digit level. Notably, the architecture is efficient enough that even a smaller, open-weight 27B-class model achieved results closely aligned with larger frontier models. This suggests that the accuracy of the system comes from the structured, deterministic workflow rather than relying solely on the raw reasoning power of a single massive AI model.
Insights from Manual Audits
A significant finding emerged during a manual audit of 226 cases where the system disagreed with the benchmark’s ground-truth labels. The authors discovered that a non-trivial portion of the existing benchmark labels appeared to deviate from the official HS general rules. By releasing their full adjudication records, the authors provide a resource for the community to review these findings, highlighting that even expert-level benchmarks may contain errors that require careful, rule-based verification.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!