Synthetic Tabular Data Created by AI
Robert E. Hoyt
David Patrishkoff
Introduction
Synthetic data is artificial data that is generated with AI and other modalities when not enough real world data is available to sufficiently train a predictive model or privacy is an issue. Synthetic data is only intended to expand the training dataset and is not intended to replace test data, which should always be real world data. It is a fully legitimate effort to experiment with various training model strategies as long as the test data is not used in any way during model training. [1] Synthetic data generated by artificial intelligence (AI) is a relatively new innovation primarily because other competing methods have been less successful in generating realistic data. Gartner, the research and market prediction organization, stated “Through 2030, for data used to train artificial intelligence (AI) models, synthetic tabular data will grow at least three times as fast as real structured data.” [2] This article will specifically address synthetic tabular data or data that fits a table or spreadsheet and not synthetic images.
Synthetic Data Advantages
In healthcare and other industries there are multiple reasons why synthetic data is appealing:
- Data quality and quantity are lacking in most fields
- Artificial intelligence and to a lesser extent machine learning requires large datasets
- In healthcare, HIPAA regulations severely limit the ability to use and share medical data
- Synthetic data would make research and development data much easier to create and access
- It would facilitate sharing clinical trial data with journals, pharmaceutical companies and regulators while avoiding patient privacy concerns
- There are potential cost savings if synthetic data can augment clinical trials, drug trials, etc. It could also decrease the cost of labeling image data.
- Synthetic data does not require special disclosures or IRB approval for research
- Synthetic data could be used to create new educational datasets of satisfactory quality and quantity
- Synthetic data could augment imbalanced datasets where the target class is imbalanced
- Synthetic data could augment imbalanced predictive features (e.g., gender, race, etc.). This could lead to less model bias and better prediction.
- Synthetic data could be used for “innovative sandboxes” [3]
Synthetic Data Disadvantages
- The real data used to generate synthetic data must meet certain requirements. One synthetic data vendor will not generate synthetic data unless there are at least 500 rows of real data.
- The larger the real dataset is the more accurate synthetic data are likely to be
- It is conceivable that synthetic data can potentially worsen bias in the data preprocessing and balancing is not carefully conducted
- Synthetic data is new and has not been widely accepted yet. Local IRBs may not have experience with synthetic data.
Synthetic Data Generation Methodologies
Earlier approaches to generate synthetic data focused on anonymization, pseudo-anonymization, and statistical methods. Additionally, there are three Python packages to generate synthetic data: “faker”, “synthetic data vault (SDV)” and “gretel.” [4]. The R package “conjurer” can also generate synthetic data. [5]
Synthea is perhaps the best known synthetic data generator not based on AI. In 2017 the Mitre Corporation developed the SyntheticMass project that emulated the healthcare data of the residents of Massachusetts. The synthesized data are fictitious, but realistic because they are based on the medical and demographic features of the residents of Massachusetts and also mirror clinical practice guidelines and expert opinions. There are multiple synthetic longitudinal datasets ready for download on their website in multiple formats (CSV, FHIR, and C-CDA). Synthea™ is a downloadable synthetic patient generator that outputs sixteen synthetic CSV files per query. [6–7] For more information about Synthea we refer you to a Medium.com article by author (RH). [8]
The earlier approaches have largely been replaced by AI. There are a variety of artificial neural network types that are capable of generating synthetic data. The most common AI methodology is generative adversarial networks (GANs) with more than 30 variations. Other methods include autoencoders, long short term memory (LSTM) networks, and ensembles. Figure 1 displays an organizational schema of synthetic tabular data generation (STDG) proposed by Hernandez et al. [9]
Figure 1. STDG Schema Proposed by Hernandez et al [9]
Synthetic Data and HIPAA
Are synthetic data considered to be exempt under HIPAA Privacy Rule? HIPAA has two pathways in order for protected health information (PHI) to be considered de-identified. The Safe Harbor Path involves the removal of 18 identifiers. The Expert Determination Path involves applying statistical principles so that the data no longer is identifiable. For this reason, synthetic data is felt to be HIPAA exempt. (figure 2) [10–11]
Figure 2. HIPAA Privacy Rules De-identification Methods
How to Evaluate Synthetic Data
The most pressing question is how close does synthetic data simulate the original (real) data? Several authors recommend judging synthetic data by three criteria: resemblance, utility and privacy:
Resemblance: There are a variety of statistical tests for resemblance to evaluate how closely synthetic data mimics real data. The following is only a partial list of metrics to compare real data with synthetic data:
- Accuracy is largely determined by statistical hypothesis testing. For example, the mean cholesterol for original and synthetic data should be similar enough that the p value is > .05
- T-tests or Wilcoxon Rank Signed Sum Tests for continuous data and chi-square testing for categorical data should be used and should show no statistical difference (p value is > .05)
- Pearson and Spearman correlations should be very similar
- Scatterplot relationships between continuous variables should be very similar
- Mean confidence intervals, or quantiles should overlap
- Mutual information between independent and dependent variables should be similar
- Distributions should be similar. Kolmogorov-Smirnov tests should be used to compare distributions [12–14].
We tested resemblance using the AI platform Gretel.ai. [15] There are several reasons we elected to use Gretel for this demonstration:
- The platform offers more than one algorithm to generate synthetic data (LSTM and GAN)
- The user interface is intuitive
- The platform includes a command line interface, a Python gretel package, and an a REST API
- Gretel is capable of synthesizing time-series data, unstructured data, relational data and image. Privacy settings are customizable
- The free version allows 15 credits per month
Gretel synthetic data were generated to simulate the heart disease prediction dataset [16] using long short term memory (LSTM) methodology. [17] This platform can also generate synthetic data using a generative adversarial network (GAN). [18] For this dataset, LSTM out-performed the GAN method.
Gretel produced a Synthetic Data Report that gave an overall quality score and privacy protection level. Of the 14 variables synthetized, the quality score on 13 was rated as excellent with one variable was rated as good.
A heatmap was generated that compared real data with synthetic data, as seen in figure 3.
Figure 3. Heatmaps of real and synthetic data
There is also a plot comparing real and synthetic principal components (not shown) and an individual display of how well variables match, as seen in figure 4.
Figure 4. Real (purple) data compared to synthetic (green)
Table 1 displays a comparison of the original data with Gretel synthetic data using the data science platform Orange. [19] Data was transferred from Orange to Excel to create a comparison table. The difference between the original and synthetic datasets were not statistically significant (p >0.05) for these numerical variables as determined by the Wilcoxon Rank Signed Sum Test. [20]
Table 1 Comparison of numerical values in the real and synthetic datasets
In addition to the above exercise, we generated 5000 synthetic heart disease prediction patients from the original real dataset of 303 patients and achieved excellent resemblance, utility and privacy scores.
Utility: Utility means that the model created with synthetic data show similar results to one created with real data. Figure 5 displays the classification model performance using logistic regression on real heart disease prediction data vs synthetic data using Orange. The Orange results were transferred to Excel so a bar chart could be created comparing real versus synthetic data results. The results are similar so there is good utility. In addition, the synthetic data and real data were both evaluated on the hold-out test data of 90 patients that were derived from the initial 70/30 split of the original 303 patients. The results on the real test data and synthetic data were highly similar. This is known as TSTR (train on synthetic, test on real data). [9]
Figure 5 Classification model performance on real vs synthetic heart disease prediction datasets
Figure 5. Classification model performance comparing real with synethic data
Privacy: privacy refers primarily to the synthetic data not including exact copies of real data. All HIPAA Identifiers should be stripped from the real data before generating synthetic data. A small dataset is at greater privacy risk than a larger dataset. A privacy report is generated by all synthetic data generating platforms. In Gretel a privacy protection level is reported and several options are available to maintain privacy:
- Outlier filter to ensure no synthetic record is an outlier making it more identifiable
- Similarity filter to ensure no synthetic record is too similar to a training record
- Overfitting option to ensure the model training stops before overfitting occurs
Synthetic Tabular Data AI Vendors
The following is a list of companies that generate synthetic tabular data using AI. While most of these programs generate a quality report reflecting synthetic data quality, other aspects vary greatly. Several offer a free trial and several have free monthly limits. The following list is not exhaustive and should not be construed as a recommended list:
- MDClone
- YData
- Gretel
- Hazy
- Mostly.ai
- Statice
- GenRocket
- ClearBox.ai
- Synthesized.io
- NVIDIA LaunchPad
- Nbsynthetic.data (open-source GAN)
Conclusions
The potential for synthetic data applications is huge in multiple domains, particularly in healthcare. It should impact medical education, biomedical data science, research and clinical trials. Like most areas related to AI, it can only get better in the future. Unlike many new technologies, this one is likely to actually save money by decreasing labeling costs and expanding clinical trial data which are expensive.
Synthetic data appears to have very good resemblance, utility, and privacy with real data, but more studies are needed to better define the best practices and best methods. Similarly, there are approximately thirty generative adversarial networks available with no clear cut winners, so this is also an area of future research.
References
- Training and test sets: Splitting data [Internet]. Google Developers. [cited 2023 Feb 5]. Available from: https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data
- Gartner. Emerging Tech: Top Use Cases for Tabular Synthetic Data. Sept 2022, [Internet]. Available from: https://k2view.com
- Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K. Can synthetic data be a proxy for real clinical trial data? A validation study [Internet]. Vol. 11, BMJ Open. 2021. p. e043497. Available from: http://dx.doi.org/10.1136/bmjopen-2020-043497
- Wijaya CY. Top 3 python packages to generate synthetic data [Internet]. Towards Data Science. 2022 [cited 2023 Feb 4]. Available from: https://towardsdatascience.com/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c
- Macherla S. Generate synthetic data using R [Internet]. R-bloggers. 2020 [cited 2023 Feb 4]. Available from: https://www.r-bloggers.com/2020/01/generate-synthetic-data-using-r/
- Synthea [Internet}. Available from https://synthea.mitre.org/
- GitHub. Synthea.[Internet]. Available from: https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running
- Hoyt R. Synthea: Do-it-yourself data [Internet]. Medium. 2020 [cited 2023 Jan 15]. Available from: https://medium.com/@rehoyt/synthea-do-it-yourself-data-6ebe4d850db6
- Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Standardized Metrics and Methods for Synthetic Tabular Data Evaluation [Internet]. Available from: http://dx.doi.org/10.36227/techrxiv.16610896.v1
- Data Privacy https://datavant.com/resources/blog/synthetic-data-healthcare
- Office for Civil Rights (OCR). Guidance regarding methods for DE-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule [Internet]. HHS.gov. US Department of Health and Human Services; 2012 [cited 2023 Feb 4]. Available from: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- Foraker RE, Yu SC, Gupta A, Michelson AP, Pineda Soto JA, Colvin R, et al. Spot the difference: comparing results of analyses from real patient data and synthetic derivatives [Internet]. Vol. 3, JAMIA Open. 2021. p. 557–66. Available from: http://dx.doi.org/10.1093/jamiaopen/ooaa060
- Benaim AR, Almog R, Gorelik Y, Hochberg I, Nassar L, Mashiach T, et al. Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies [Internet]. Vol. 8, JMIR Medical Informatics. 2020. p. e16492. Available from: http://dx.doi.org/10.2196/16492
- Haddad F. AWS Machine Learning Blog. December 2022. How to evaluate the quality of the synthetic data — measuring from the perspective of fidelity, utility, and privacy. Available from: https://aws.amazon.com/blogs/machine-learning/how-to-evaluate-the-quality-of-the-synthetic-data-measuring-from-the-perspective-of-fidelity-utility-and-privacy/
- Gretel. [Internet]. Available from: https://gretel.ai
- UCI machine learning repository: Heart disease data set [Internet]. [cited 2023 Feb 4]. Available from: https://archive.ics.uci.edu/ml/datasets/heart+disease
- Brownlee J. A Gentle Introduction to Long Short-Term Memory Networks by the Experts [Internet]. MachineLearningMastery.com. Machine Learning Mastery; 2017 [cited 2023 Feb 4]. Available from: https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/
- Brownlee J. A gentle introduction to generative adversarial networks (GANs). Machine Learning Mastery [Internet]. 2019;17. Available from: https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/
- Bioinformatics Laboratory, University of Ljubljana. Data Mining [Internet]. [cited 2023 Feb 4]. Available from: https://orangedatamining.com/
- Wilcoxon Signed Rank Test. Stats Kingdom. Available from: https://www.statskingdom.com/175wilcoxon_signed_ranks.html