It is difficult to find patient-level data of sufficient size for research, modeling, or software development. This is largely due to HIPAA concerns and the overall lack of interoperability in the US healthcare system.

Synthetic data has potential in those areas but much of the generated data is non-medical. For example, R and Python programming languages can generate non-medical datasets for supervised and unsupervised learning. Generative adversarial networks (GANs) can also generate synthetic non-medical data.

In 2017 the Mitre Corporation developed the SyntheticMass project that emulated the healthcare data of the residents of Massachusetts. There are more than 1 million synthetic medical records available and they can be accessed in multiple ways to include an HL7 FHIR STU3 API. The data are fictitious, but realistic because they are based on the medical and demographic features of the residents of Massachusetts. [1]

There are datasets ready for download on their website in multiple formats (CSV, FHIR, and C-CDA):

  • 10,000 and 100,000 patients with COVID-19
  • 6,357 obese children
  • 1,000 generic patients
  • I million patients
  • More data on Google Cloud Console

A significant aspect of the SyntheticMass project is Synthea,™ the synthetic patient generator with the following features:

  • Birth to death lifecycle, so there are deceased patients in the datasets
  • Longitudinal data
  • Configuration-based statistics and demographics (defaults with Massachusetts Census data)
  • There are about 90 Modules covering the most common medical diseases and conditions
  • The user can customize modules and create their own
  • Encounters: Primary Care, Emergency Room, and Symptom-Driven
  • File folders created: see below description
  • Formats: HL7 FHIR (STU3 v3.0.1, DSTU2 v1.0.2, and R4, C-CDA, and CSV

Synthea™ can be downloaded and operated with any computer terminal with simple commands and the use of Gradle. This permits a user to generate their own customized patient data. Installation and setup instructors are found here. The typical steps are as follows:

  • Be sure to install JDK 14
  • For the Mac OS download the jar file to e.g., the downloads folder
  • Using the terminal, change the directory to where the file was downloaded with the following command: cd/Users/name/Downloads
  • Type at the terminal: java -jar synthea-with-dependencies.jar
  • Type a request: java -jar synthea-with-dependencies.jar -p100 -g F -m Asthma — exporter.csv.export true (p is for population size, g is gender and -m is the disease/condition module)

Keep in mind that this will not generate 100 asthmatics. If the incidence of asthma is 5% in Massachusetts, then you will only generate 5 asthmatic patients out of the 100 patients downloaded. You will automatically generate other common diseases in addition to asthma. All requests for synthetic data will automatically generate 16 files (csv or fhir). The data dictionary for each file is located here and the file categories are as follows:

The following is a screenshot of the patients.csv file

Little has been written about Synthea™ and the possible application to healthcare in the medical literature. An article was published in 2019 that compared four quality measure standards e.g., colorectal cancer screening, with Synthea™ data and concluded that the demographics and services were accurately reflected in the data, but the health outcome measures did not accurately reflect state or national statistics.[2] Rossmiller used this application to generate realistic cancer patients for a nurse-facing app. [3] Kaul used Synthea™ to model healthcare data in a TigerGraph graph database.[4]

The most recent (November 2020) article about Synthea™ was written by the originators of the SyntheticMass project and in the article they discussed synthetic COVID-19 data. The datasets were based on three published studies and included 88,000+ COVID patients and 18,000+ hospitalizations. There was a 4.1% mortality rate, 20.6% hospitalization rate, and 209 ventilator cases. The data and code are available on GitHub. [5]

The longitudinal nature of the data is a huge plus for those studying the natural progression of diseases. Recurrent neural networks would be a favored deep learning tool to mine that type of data. There are multiple years of data so some patients may have more than 100 test results recorded. On the other hand, some might want a cross-sectional view of the data, so the data would have to be re-configured. Synthea™ generated data has already been used for hackathons and other data competitions. It seems likely that it will be used for benchmarking to compare various models. Could it be used for transfer learning where deep learning trains on a large synthetic dataset and then trains further on real data? Many more studies are needed to adequately evaluate this new data science innovation.


  1. Walonoski J, Kramer M, Nichols J et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic healthcare record. JAMIA March 2018. Walonoski et al.
  2. Chen J, Chun D, Patel M et al. The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures. BMC Med Inform Decis Mak 2019 Mar 14;19(1):44
  3. Rossmiller, R. Synthetic Health Data Generation: My first experience with Synthea. July 31, 2019.
  4. Kaul, A. Modeling Healthcare Data with Graph Databases Using TigerGraph and Synthea to Create a Synthetic Healthcare System July 6, 2020.
  5. Walonoski J, Klaus S, Granger E et al. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set Intelligence-Based Medicine. Volumes 1–2, November 2020, 100007

Dr. Hoyt is a physician data scientist who is also an author and editor of several books. His most recent textbook is Data Preparation and Exploration.