Back to AI Research

AI Research

A Pipeline for Generating Longitudinal Synthetic Cl... | AI Research

Key Takeaways

  • A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models This paper introduces a new method for creating synthetic clinica...
  • Synthetic data is increasingly used to enable the development and evaluation of AI systems in domains where access to real-world data is restricted.
  • In healthcare, clinical documentation presents particular challenges due to its sensitivity.
  • This work introduces a synthetic clinical notes pipeline and dataset designed to support the development of clinical AI tools while avoiding the privacy risks associated with real patient data.
  • The dataset is generated using a modular pipeline that combines structured patient generation, semi-structured patient journey simulation, and unstructured clinical note generation using large language models.
Paper AbstractExpand

Synthetic data is increasingly used to enable the development and evaluation of AI systems in domains where access to real-world data is restricted. In healthcare, clinical documentation presents particular challenges due to its sensitivity. This work introduces a synthetic clinical notes pipeline and dataset designed to support the development of clinical AI tools while avoiding the privacy risks associated with real patient data. The dataset is generated using a modular pipeline that combines structured patient generation, semi-structured patient journey simulation, and unstructured clinical note generation using large language models. The pipeline is designed to prioritise internal consistency across longitudinal patient records, while also capturing variation in writing style, note structure, and clinical detail. Additional mechanisms, including LLM-based validation and augmentation steps, are used to improve faithfulness, realism, and diversity of the generated notes. We release a dataset of 70 synthetic patients, each associated with 20-50 clinical notes spanning a full hospital journey. The dataset is provided at multiple levels of validation, enabling users to balance realism and scalability depending on their use case. This dataset supports the development, testing, and evaluation of clinical AI systems, including summarisation tools, coding models, and decision support systems, without reliance on real patient data.

A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models
This paper introduces a new method for creating synthetic clinical notes to help researchers develop and test healthcare AI tools. Because real patient data is highly sensitive and difficult to access due to strict privacy regulations, the author developed a modular pipeline that uses Large Language Models (LLMs) to generate realistic, longitudinal patient records from scratch. By simulating entire hospital journeys—from admission to discharge—the pipeline provides a safe, privacy-preserving alternative for training systems like summarization tools, decision support software, and automated coding models.

How the Pipeline Works

The generation process is broken down into five distinct stages to ensure the data is both realistic and internally consistent. First, the system uses a tool called Synthea to create basic patient demographics, which are then expanded by an LLM to include more personal details. Next, the pipeline generates admission reasons and simulates a "patient journey," which maps out a series of clinical events over time. Finally, the system generates the actual clinical notes for each event. To add a layer of realism, the pipeline assigns different "clinical personas" to staff members, ensuring that writing styles—such as bullet points, narrative prose, or shorthand notes—remain consistent throughout a patient's record.

Ensuring Quality and Realism

To prevent the common issue of "hallucinations" (where an AI generates factually incorrect information), the pipeline incorporates several validation mechanisms. After a note is generated, an LLM validator reviews it to ensure it remains faithful to the patient’s history and the specific event being described. The system also includes an augmentation stage that introduces common real-world elements, such as typos, medical abbreviations, and staff sign-offs. These steps help bridge the gap between perfectly generated text and the messy, varied nature of actual clinical documentation.

Dataset Availability and Use

The project provides a dataset of 70 synthetic patients, with each patient having between 20 and 50 clinical notes. The data is organized into three tables covering patient demographics, admission details, and the clinical notes themselves. The author plans to release this data in three tiers of validation: Bronze, Silver, and Gold. Currently, the "Silver" dataset is available, which has been generated through the validated pipeline but does not yet include final manual review by clinicians. This tiered approach allows researchers to choose the level of validation that best fits their specific project needs.

Important Considerations

While this pipeline offers a powerful way to generate data without privacy risks, users should be aware of a few limitations. The author notes that some special characters may be incorrectly decoded during the generation process, so basic data cleaning is recommended before use. Additionally, while the pipeline is designed to be highly adaptable, it is currently optimized for secondary care settings. As with any synthetic data, the utility of the dataset depends on the specific requirements of the AI system being developed, and users are encouraged to evaluate the data against their own performance metrics.

Comments (0)

No comments yet

Be the first to share your thoughts!