A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models
This paper introduces a new method for creating synthetic clinical notes to help researchers develop and test healthcare AI tools. Because real patient data is highly sensitive and difficult to access due to strict privacy regulations, the author developed a modular pipeline that uses Large Language Models (LLMs) to generate realistic, longitudinal patient records from scratch. By simulating entire hospital journeys—from admission to discharge—the pipeline provides a safe, privacy-preserving alternative for training systems like summarization tools, decision support software, and automated coding models.
How the Pipeline Works
The generation process is broken down into five distinct stages to ensure the data is both realistic and internally consistent. First, the system uses a tool called Synthea to create basic patient demographics, which are then expanded by an LLM to include more personal details. Next, the pipeline generates admission reasons and simulates a "patient journey," which maps out a series of clinical events over time. Finally, the system generates the actual clinical notes for each event. To add a layer of realism, the pipeline assigns different "clinical personas" to staff members, ensuring that writing styles—such as bullet points, narrative prose, or shorthand notes—remain consistent throughout a patient's record.
Ensuring Quality and Realism
To prevent the common issue of "hallucinations" (where an AI generates factually incorrect information), the pipeline incorporates several validation mechanisms. After a note is generated, an LLM validator reviews it to ensure it remains faithful to the patient’s history and the specific event being described. The system also includes an augmentation stage that introduces common real-world elements, such as typos, medical abbreviations, and staff sign-offs. These steps help bridge the gap between perfectly generated text and the messy, varied nature of actual clinical documentation.
Dataset Availability and Use
The project provides a dataset of 70 synthetic patients, with each patient having between 20 and 50 clinical notes. The data is organized into three tables covering patient demographics, admission details, and the clinical notes themselves. The author plans to release this data in three tiers of validation: Bronze, Silver, and Gold. Currently, the "Silver" dataset is available, which has been generated through the validated pipeline but does not yet include final manual review by clinicians. This tiered approach allows researchers to choose the level of validation that best fits their specific project needs.
Important Considerations
While this pipeline offers a powerful way to generate data without privacy risks, users should be aware of a few limitations. The author notes that some special characters may be incorrectly decoded during the generation process, so basic data cleaning is recommended before use. Additionally, while the pipeline is designed to be highly adaptable, it is currently optimized for secondary care settings. As with any synthetic data, the utility of the dataset depends on the specific requirements of the AI system being developed, and users are encouraged to evaluate the data against their own performance metrics.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!