What the paper is about
Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.
What it covers
\equalcont Zhao Yang and Wang Huan contributed equally as co-first authors. [1] \fnm Wang \sur Huan 1] \orgdiv School of Electronic Information, \orgname Zhongshan Institute, University of Electronic Science and Technology of China, \orgaddress \street No. 1 Xueyuan Road, \city Zhongshan, \postcode 528402, \state Guangdong, \country China 2] \orgname Changchun Kelaile Technology Co., Ltd, \orgaddress \street Qinglong Road, Luyuan District, \city Changchun, \postcode 130000, \state Jilin, \country China \miscnote Corresponding author: Wang Huan; Zhao Yang and Wang Huan contributed equally as co-first authors. A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency \fnm Zhao \sur Yang [email protected] [email protected] \fnm Li \sur Yingshuo [email protected] \fnm Tu \sur Haomiao [email protected] \fnm Lin \sur Hujite [email protected] [ [ Abstract Large language models (LLMs) often suffer from fact loss, timeline confusion, persona continuity drift, and reduced stability during long-range interactions, especially under high-noise knowledge bases and cross-model transfer settings. To address these issues, we introduce ARPM (Analysis-Based Role-Playing with Memory), an external temporal memory governance framework for long-term dialogue. ARPM physically separates static knowledge memory from dynamic dialogue experience memory. The framework employs a multi-stage pipeline: a retrieval layer that combines vector retrieval, BM25, and RRF fusion; a ranking layer that introduces two temporal coordinates, physical time and dialogue round; and a reading layer that preserves the original semantics and source information of retrieved evidence while organizing candidate content chronologically. This allows the model to access the date of the most recent round and relative round cues. At the generation layer, a controlled ¡analysis¿ protocol is used to verify, rerank, and bind candidate evidence to the answer. Unlike approaches that encode persona consistency directly into model weights, ARPM treats long-term continuity as a traceable, auditable, and transferable external governance problem. Using complete engineering logs, we construct three types of experiments. First, under the same 50-round structured question-answering setting, we compare two signal-to-noise ratio conditions, 1:5 and 1:200+, while distinguishing between raw CSV auto-judgment and manual review criteria. Under the 1:5 condition, the original CSV rolling recall accuracy is 54.0%, whereas manual review raises it to 100.0%. Under the 1:200+ condition, the corresponding values are 44.0% and 80.0%, respectively. These results indicate that automatic rules substantially underestimate truly effective recall when evidence has entered the Prompt and is correctly used by the model. Second, the ablation study shows that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%. Disabling BM25 reduces strict accuracy to 80.0%, indicating that pure semantic retrieval is insufficient for long-chain correction and precise tracing. The main role of the dual-temporal mechanism lies in temporal organization and anomaly suppression, rather than in the surface-level correctness of single-turn answers. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff conditions, ARPM maintains relatively high semantic continuity, boundary continuity, and persona consistency across multiple general-purpose model stages, while explicitly exposing boundaries such as insufficient protocol compliance in small models, frame of reference drift, and theatrical drift in dedicated role-playing models. These findings suggest that ARPM’s advantage does not lie in “tuning” a single model into a particular role. Rather, by combining heterogeneous external memory, dual-temporal reranking, temporal evidence unfolding, and analysis-driven evidence verification, ARPM provides a set of transferable continuity conditions for different foundation models. This study demonstrates that long-term persona consistency can be decomposed into engineering components and evaluated in a white-box manner, without relying entirely on the parametric memory of a single model or dedicated role fine-tuning. Systematic memory governance offers a feasible path toward transferable and stable interaction with large language models. keywords: large language models; external memory governance; dual-temporal reranking; high-noise retrieval; persona consistency; cross-model transfer 1 Introduction As applications of large language models (LLMs) move from single-turn instruction following and question answering toward long-term agents, personalized assistants, and social chatbot systems, the primary system bottleneck is no longer limited to the parameter scale of the foundation model. It increasingly lies in whether a stable, transferable, and precisely retrievable external memory layer can be built around user interactions, project backgrounds, and dynamic experiences. The technical foundations of modern LLMs are commonly traced to the Transformer architecture, whose self-attention mechanism substantially improved sequence modeling, contextual representation, and large-scale parallel training capabilities [ 11 ] . Yet the ability to accept longer contexts does not necessarily imply stable use of key evidence within those contexts. Prior work has shown that the position of relevant information in long contexts can substantially affect model reading performance; when evidence appears in the middle of the context, performance may decline markedly [ 16 ] . This issue is especially salient in long-term companionship, personalized assistant, and complex collaboration scenarios, where consistency in facts, time, identity, and boundaries is challenged simultaneously: the foundation model may be replaced, the context window may be periodically cleared, the external knowledge base may contain large amounts of noise, and users may continually revise previously stated facts. Without appropriate external memory governance, the system is prone to old facts overriding new facts, static knowledge interfering with recent experiences, timeline misalignment, and partial semantic hits that still lead to globally inaccurate answers. Retrieval-augmented generation (RAG) effectively alleviates hallucinations in static knowledge scenarios [ 1 ] and has developed into a technical framework centered on the coordinated optimization of retrieval, augmentation, and generation [ 23 ] . Before RAG, open-domain question answering had already established a “retrieval–reading comprehension” paradigm. For example, DrQA combined Wikipedia retrieval with machine reading comprehension to answer open-domain factual questions [ 24 ] . REALM further introduced a retriever into language model pretraining, enabling explicit document retrieval for external knowledge support [ 12 ] . DPR showed that dense dual-encoder retrieval can effectively replace traditional sparse retrieval in open-domain question answering and improve candidate passage recall [ 13 ] . However, these lines of work still typically optimize for whether a single question can be answered correctly, and pay relatively limited attention to two core challenges in long-term continuous interaction. First, memory types are inherently heterogeneous: objective knowledge, chat experiences, user corrections, style boundaries, and task states operate at different temporal scales and reliability levels. Second, evidence use before and after generation can diverge: in real systems, even when supporting evidence has already entered the Prompt, automatic scoring rules may still mark a round as a “retrieval error” because of a top-1 miss, field-matching failure, or overly strict criteria, thereby underestimating the system’s actual capability. The problem is therefore not merely whether a particular model resembles a certain persona. It is whether, under the simultaneous presence of a noisy knowledge base, context discontinuity, and model replacement, the system can continue the same narrative, the same set of key experiences, and the same interaction boundaries. To address this problem, this paper proposes ARPM. Rather than relying on the parametric memory of a single model for long-term continuity, ARPM decomposes the problem into four governable components: heterogeneous dual-source memory decoupling, dual-temporal ranking, analysis-driven evidence verification, and white-box logging with manual review. In this formulation, the system’s “memory” is no longer a single hit in a vector database, but a complete closed loop from candidate retrieval, evidence entry, and generation constraints to post-hoc auditing. Building on this methodological formulation, the paper implements the main line of “heterogeneous decoupling + dual-temporal modeling + metacognitive gate” through reproducible engineering logs and quantifiable experimental design. The method is therefore not only a conceptual framework, but also a traceable and testable process for continuity governance. Rather than making the abstract claim that “the system is more human-like,” this study uses real data to answer three specific questions: (1) Can ARPM go beyond mechanical top-1 rules and complete secondary utilization after evidence enters the Prompt? (2) Without semantic compression or graph-based rewriting, does unfolding retrieved evidence in chronological order and explicitly exposing the date of the most recent round help the model establish a more stable frame of reference and improve its analysis process? (3) Can long-term continuity transfer across models with the support of external memory and temporal anchoring, and what clearly observable boundaries remain? 1.1 Main Contributions The contributions of this work are fourfold: [1.] 1. A heterogeneous dual-source external memory governance framework. This paper physically separates static knowledge memory from dynamic dialogue experience memory and merges them at the candidate retrieval stage. This structure alleviates retrieval competition among static knowledge, old experiences, and recent facts during long-range interactions. 2. A coordinated mechanism for dual-temporal ranking, temporal evidence unfolding, and hybrid retrieval. By jointly modeling physical time and dialogue round, this paper combines vector retrieval, BM25, and RRF fusion ranking. When candidate evidence enters the Prompt, the framework preserves the original semantics and source information, organizes retrieved content chronologically, and explicitly exposes the date of the most recent round and relative round cues. This enables the system to obtain more stable evidence candidates and a clearer frame of reference under high-noise knowledge bases, recent fact recovery, and time-sensitive follow-up questions. 3. An analysis-driven evidence verification mechanism. This paper defines the protocol as a controlled verification and binding mechanism after retrieval and before generation, rather than as an open-ended long Chain-of-Thought. This mechanism helps explain why, in some rounds, automatic rules fail to hit while the evidence that has entered the Prompt is still correctly used by the model. 4. A cross-model continuity validation process. This paper conducts multi-stage handoff experiments under high-noise, context-clearing, and foundation-model switching conditions. The results show that long-term persona consistency is not fully bound to the weights of a single model, but can achieve relatively strong cross-model transfer through the joint effect of external memory, temporal anchoring, and generation constraints, while explicitly exposing its capability boundaries. 2 Related Work 2.1 Retrieval-Augmented Generation and Long-Dialogue Retrieval Problems Conventional RAG primarily targets static knowledge question answering, with optimization focused on improving external fact retrieval and answer correctness [ 1 ] . In the broader development of open-domain question answering, early systems such as DrQA combined large-scale document retrieval with reading comprehension, establishing the basic “retrieve first, then read” paradigm [ 24 ] . REALM introduced retrieval augmentation into language model pretraining, making external knowledge access part of the model’s capability [ 12 ] . DPR further improved candidate passage recall in open-domain question answering through dense dual-encoder representations [ 13 ] . In recent years, RAG has evolved from simple retrieval-augmented question answering into more complex system forms, including Advanced RAG and Modular RAG [ 23 ] . However, when the task shifts to long-term dialogue, the central challenge is no longer merely whether a relevant document can be retrieved, but which type of memory should be prioritized at the current moment. If static knowledge, recent user experiences, historical task states, and role-boundary texts are mixed into the same index, the system is prone to retrieval competition: older knowledge that appears semantically closer may override a newly corrected user fact. Term-matching methods such as BM25 remain effective for hard fact localization [ 2 ] , while ranking fusion methods such as RRF can integrate heterogeneous retrieval signals [ 3 ] . Building on this line of work, this paper further emphasizes that the challenges of long-term memory systems arise not only at the retrieval end, but also propagate to evidence usage bias in the reading phase and generation stage. 2.2 Memory-Augmented Large Language Models To address long-context limitations, prior work has proposed various external memory mechanisms, including hierarchical memory management, multi-level paging, event extraction, temporal decay, and dialogue summarization [ 4 , 5 , 6 , 7 , 8 ] . These methods demonstrate the feasibility of externalizing memory. However, in scenarios involving long-term persona and temporal continuity, a single timeline and a mixed index are often still insufficient. Moreover, long context itself does not necessarily constitute reliable memory: previous studies have shown that language models are sensitive to the position of evidence in long contexts. Relevant information is typically easier to use when it appears at the beginning or end of the context, whereas performance may decline when it appears in the middle [ 16 ] . This indicates that long-term memory systems need to address not only indexing and retrieval, but also how candidate evidence is organized, ranked, and read in the Prompt. Under conditions such as user-corrected facts, cross-day dialogue, and periodic context clearing, relying solely on similarity-based retrieval can easily lead to frame of reference misalignment. ARPM differs in that it simultaneously introduces two temporal dimensions—physical time and dialogue round—and uses them for ranking competition and evidence organization in the reading phase, rather than treating them merely as metadata for post-hoc archiving. 2.3 Persona Consistency and Cross-Model Continuity Research on persona consistency generally follows two paths. One line of work solidifies role style and boundaries into model weights through supervised fine-tuning or preference optimization. The other relies on a system prompt or setting document for continuous control [ 9 , 10 ] . In open-domain dialogue research, works such as PersonaChat use persona profile-conditioned generation, enabling chatbots to produce more specific and consistent responses around given identity information [ 9 ] . Profile Consistency Identification further focuses on identifying persona-profile consistency in open-domain dialogue [ 10 ] . In addition, some studies formulate dialogue consistency as a natural language inference problem and construct the Dialogue NLI dataset to determine whether persona settings and model responses stand in entailment, contradiction, or neutrality relations [ 21 ] . From the broader perspective of social chatbot research, user engagement, emotional connection, and persona continuity in long-term interaction have long been important goals in dialogue system design [ 25 ] . However, methods based on role fine-tuning have limited transferability when the foundation model changes, while methods based on prompts or setting documents tend to degrade as dialogue length increases and context is cleared. In contrast, ARPM addresses the problem of external continuity governance: whether the system can maintain the same fact chain, task chain, and role boundary chain after model replacement. This paper separates “remembering” from “being like the same persona” and argues that long-term persona consistency is not a single memory accuracy score, but the joint outcome of fact retrieval, temporal judgment, protocol compliance, language style, and boundary behavior. 3 Method 3.1 Overall Design Objective ARPM is not intended to make the model “better at performing” across all scenarios. Instead, it decomposes long-term continuous interaction into an engineering process that is governable, observable, and reproducible. The core system consists of four components: heterogeneous dual-source memory decoupling, dual-temporal ranking and temporal evidence unfolding, analysis-driven evidence verification, and white-box log write-back. The basic workflow is shown in Figure 1. A user query first undergoes lightweight query augmentation and is then routed separately to the knowledge memory route and the experience memory route. Candidates from the two routes are merged under dual-temporal constraints, and the retrieved content is organized chronologically while preserving its original semantics. After entering the Prompt, the evidence is verified and bound to the answer through a controlled protocol. Finally, the response and intermediate states are written into structured logs. Figure 1: Overview of the ARPM method: heterogeneous dual-source retrieval, dual-temporal ranking, temporal evidence unfolding, and the analysis-driven evidence verification pipeline. 3.2 Heterogeneous Dual-Source Memory Decoupling ARPM divides external memory into two types. The first is knowledge memory, which stores relatively stable, low-frequency objective knowledge or task background. The second is experience memory, which stores high-frequency user facts, interaction states, and local topic chains. Knowledge memory adopts a parent-child chunk structure: child chunks are used for fine-grained semantic matching, while parent chunks provide more complete context slices to the generation stage. Experience memory directly supports recent topic continuation and the recovery of user-corrected facts. This physical separation does not mean that the two routes never interact; rather, it prevents them from mixing prematurely during retrieval. The knowledge route introduces hybrid retrieval with BM25 and vector retrieval [ 2 ] , followed by RRF fusion ranking to handle high-noise factual question answering [ 3 ] . The vector retrieval route follows the same basic idea as dense retrieval methods such as DPR: candidate evidence is obtained through nearest-neighbor search in a semantic vector space [ 13 ] . In the engineering implementation, FAISS provides a mature tool foundation for large-scale vector similarity search, index construction, and approximate nearest-neighbor retrieval [ 15 ] . The experience route remains sensitive to colloquial short references, local topic chains, and recent fact updates. What ultimately enters the Prompt is not a single top-1 result, but a combination of candidates after reranking in the two routes. To keep the method description from remaining purely conceptual, the main formulas corresponding to the current engineering implementation are given below. The knowledge route first computes the vector inner-product score: s vec ( q , d ) = q ⋅ d s_{\mathrm{vec}}(q,d)=q\cdot d (1) It then computes a BM25+-style keyword score: s BM25 ( q , d ) = ∑ t ∈ q idf ( t ) ⋅ [ tf ( t , d ) ( k 1 + 1 ) tf ( t , d ) + k 1 ( 1 − b + b ⋅ | d | / avgdl ) ] s_{\mathrm{BM25}}(q,d)=\sum_{t\in q}\operatorname{idf}(t)\cdot\left[\frac{\operatorname{tf}(t,d)(k_{1}+1)}{\operatorname{tf}(t,d)+k_{1}(1-b+b\cdot|d|/\operatorname{avgdl})}\right] (2) where idf ( t ) = max ( 0 , log N − df ( t ) + 0.5 df ( t ) + 0.5 ) + δ \operatorname{idf}(t)=\max\left(0,\log\frac{N-\operatorname{df}(t)+0.5}{\operatorname{df}(t)+0.5}\right)+\delta . This implementation uses a BM25 variant with positively biased IDF, denoted as a BM25+-style score. Vector retrieval and BM25 retrieval are fused through Reciprocal Rank Fusion: s RRF ( d ) = ∑ i 1 k rrf + rank i ( d ) + 1 s_{\mathrm{RRF}}(d)=\sum_{i}\frac{1}{k_{\mathrm{rrf}}+\operatorname{rank}{i}(d)+1} (3) The base score of the knowledge route can be written as: s kb 0 ( d ) = norm ( s RRF ( d ) ) + b user + b character + b source s{\mathrm{kb}}^{0}(d)=\operatorname{norm}(s_{\mathrm{RRF}}(d))+b_{\mathrm{user}}+b_{\mathrm{character}}+b_{\mathrm{source}} (4) The dialogue history route uses normalized cosine similarity as the semantic base score: s sem ( q , d ) = clip ( cos ( q , d ) , 0 , 1 ) s_{\mathrm{sem}}(q,d)=\operatorname{clip}(\cos(q,d),0,1) (5) s chat 0 ( d ) = s sem ( q , d ) + b session + b user + b character s_{\mathrm{chat}}^{0}(d)=s_{\mathrm{sem}}(q,d)+b_{\mathrm{session}}+b_{\mathrm{user}}+b_{\mathrm{character}} (6) 3.3 Lightweight Query Augmentation, Dual-Temporal Ranking, and Temporal Evidence Unfolding In real logs, ARPM applies lightweight role-prefix enhancement to the query, namely: q ′ = [ User u ] [ Assistant a ] + q q^{\prime}=[\text{User }u][\text{Assistant }a]+q This design does not perform the main task of “understanding.” Its purpose is to provide the minimum necessary interaction-role information for the knowledge route and the experience route. The more critical processing occurs in the ranking and reading stages. The system maintains two temporal coordinates for each memory chunk: absolute physical time and relative dialogue round. The former indicates when an event occurred, while the latter indicates how far it is from the current topic. ARPM performs dual-temporal reranking separately in the two routes. In the knowledge route, retrieval fusion and role weighting are completed first, followed by temporal weighting. In the dialogue history route, semantic matching and role weighting are completed first, followed by temporal weighting. The round decay term and physical-time decay term are defined as follows: w round ( d ) = exp ( − | r current − r d | λ round ) w_{\mathrm{round}}(d)=\exp\left(-\frac{|r_{\mathrm{current}}-r_{d}|}{\lambda_{\mathrm{round}}}\right) (7) w clock ( d ) = exp ( − Δ h d λ hours ) w_{\mathrm{clock}}(d)=\exp\left(-\frac{\Delta h_{d}}{\lambda_{\mathrm{hours}}}\right) (8) The dual-temporal retention weight is the product of the two: w temporal ( d ) = w round ( d ) ⋅ w clock ( d ) w_{\mathrm{temporal}}(d)=w_{\mathrm{round}}(d)\cdot w_{\mathrm{clock}}(d) (9) In the two routes, temporal reranking acts on their respective base scores: s kb ( d ) = s kb 0 ( d ) ⋅ w temporal kb ( d ) s_{\mathrm{kb}}(d)=s_{\mathrm{kb}}^{0}(d)\cdot w_{\mathrm{temporal}}^{\mathrm{kb}}(d) (10) s chat ( d ) = s chat 0 ( d ) ⋅ w temporal chat ( d ) s_{\mathrm{chat}}(d)=s_{\mathrm{chat}}^{0}(d)\cdot w_{\mathrm{temporal}}^{\mathrm{chat}}(d) (11) In the current engineering implementation, the two routes perform temporal reranking separately. Under the default configuration, the same decay constants are often used, with λ round \lambda_{\mathrm{round}} = 20 and λ hours \lambda_{\mathrm{hours}} = 168 hours. However, their weighting and competition processes are independent and can therefore be adjusted separately according to the experimental setting. 3.4 Analysis-Driven Evidence Verification and Reranking We treat the protocol as a controlled pre-generation verification mechanism rather than an open-ended reasoning chain. Prior work on Chain-of-Thought prompting has shown that explicit intermediate reasoning steps can improve the performance of LLMs on complex tasks [ 17 ] , and self-consistency reasoning further suggests that selecting among multiple candidate reasoning paths based on consistency can reduce incidental errors introduced by any single path [ 18 ] . However, the protocol in ARPM is specifically constrained as a post-retrieval and pre-generation mechanism for evidence verification, implicit re-ranking, and answer binding. This design is methodologically related to reasoning-and-acting frameworks such as ReAct, where the model does not rely solely on internal parameters, but establishes a more interpretable closed loop among explicit evidence access, local reasoning, and output constraints [ 19 ] . Consequently, the stage does not replace retrieval. Instead, after candidate evidence has entered the Prompt and has been organized chronologically, it performs three core operations: Assessing evidence requirements: determining whether the current question relies more on knowledge evidence, experience evidence, or a combination of both; Semantic verification and implicit re-ranking: performing semantic verification and implicit re-ranking among multiple parent chunks that have entered the Prompt; Answer binding: binding the final answer to interpretable evidence sources. The “CoT noise-resistance capability” emphasized by ARPM does not mean allowing the model to reason without grounding. Rather, it aims to reduce erroneous dependencies and improve evidence utilization within the existing candidate context. This design helps explain the discrepancy observed in the high-noise experiment: although the original CSV-based automatic rules frequently marked certain rounds as 0, manual review found that the corresponding supporting parent chunks had already entered the Prompt, and that the model’s answer was semantically consistent with the evidence. This phenomenon directly reflects the joint effect of the protocol and temporal evidence unfolding. 3.5 Real-Time Atomic Write-Back and White-Box Traceability Mechanism Real-time atomic distillation and the white-box traceability mechanism constitute one of the key engineering components of this paper. In the current implementation, the system does not simply persist raw dialogue as monolithic logs. Instead, for each interaction round, it writes the query, candidate parent chunks, pre-generation analysis, final response, and source markings into logs and indexes as structured atomic units, thereby providing traceable candidates for retrieval in the next round. In other words, ARPM’s external memory is not a long text summarized after the fact, but a set of atomic records with source, temporal, and path markings. These records preserve their original semantics when read, without compression-based rewriting or graph-based transformation. To ensure that the above heterogeneous decoupling and temporal governance mechanisms can be strictly executed and quantitatively evaluated, ARPM is instantiated at the engineering level as a modular monolithic Web service architecture. The frontend provides a multi-turn dialogue interface and experimental intervention interface, while the backend completes the data closed loop from query augmentation, dual-source retrieval, and dual-temporal ranking to response generation and write-back auditing. To address the difficulty of fine-grained attribution in long-range LLM evaluation, which often relies heavily on black-box logs from commercial APIs, the system writes all key fields to disk in real time using a standardized JSON structure. These fields include chunk_id, session_id, user_input, assistant_reply, source_type, physical timestamp, and relative dialogue round. This design is partly consistent with the idea of tool-augmented language models: the model does not rely entirely on internal parametric capabilities, but obtains more reliable task-execution conditions through external tools, retrieval systems, APIs, or structured logging mechanisms [ 22 ] . ARPM’s external memory layer therefore supports not only retrieval augmentation, but also evidence tracing, error attribution, and state write-back. This traceability paradigm enables offline replay and graphical attribution, providing a white-box auditing basis for memory drift, changes in temporal weights, and error-path identification. Baseline sampling shows that the system has recorded more than 1,000 atomic memory blocks, covering 6 independent session segments and 9 ablation session segments. The main session depth is approximately 180 rounds, and the number of atomic chunks in a single session is approximately 50–180. In complex long-range contexts, responses containing tags account for approximately 85%–95%. These logs are not supplementary materials, but the foundation of the experimental design and the conclusions of this paper. 4 Experimental Design 4.1 Implementation and Logging Baseline ARPM was configured using parameters drawn from real operational logs. In a representative setting, knowledge_k was set to 5, chat_history_k to 10, similarity_threshold to 0.5, and RRF_k to 60, with both BM25 and RRF enabled. Role enhancement and temporal decay were applied independently to the knowledge route and the experience route. Top-K was not fixed across all settings: the high-noise retrieval experiment used a forced 3+3 retrieval configuration, while the other experiments generally used 5+10. BM25 was retained in the knowledge route to support precise factual localization under high-noise conditions. The experience route, by contrast, was designed to
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!