Can Large Language Models Create Tabular Synthetic Data?
They can, but they have a variety of challenges
I have posted my previous experience creating synthetic data on Medium.com with programs such as Synthea and AI programs e.g., Mostly AI that clone data using primarily generative adversarial networks (GANs).
Large language models (LLMs) can now generate text, images, audio, and videos from human text prompts so why not see if they can generate tabular data? (data that fits into a table)
This project had three goals.
Goal 1: test multiple LLMs to see if they could generate tabular datasets that could be used for data science education or research.
Goal 2: determine if the dataset created could be downloaded by generating Python code that could be copied and pasted into a notebook or data that could be copied and pasted into a spreadsheet without Python code.
Goal 3: determine if the LLM generated data has realistic relationships between the independent and dependent variables.
Five experiments (trials) were conducted to see if LLM-generated data could satisfy the three goals.
All LLMs tested resided on Poe.com, except Gemini Advanced. Many LLMs were free to access but some required a subscription as noted below:
- GPT-4 — Subscription
- Claude 3 Opus — Subscription
- Claude 3 Sonnet — Free
- Claude 3 Haiku — Free
- Gemini Pro — Free
- Gemini Advanced — Subscription
- Claude Instant — Free
- Mistral Large — Subscription
Trial #1Few shot learning using an uploaded dataset
Strategy: Upload the Framingham heart disease prediction dataset (Kaggle.com) to serve as an example of what is desired.
Prompt: “Based on the uploaded file, create 100 new rows with the same column variables in the dataset and keep the same relationships between dependent and independent variables Create a table that can be downloaded as a CSV file or Python code.”
Generated data was uploaded into Orange Data Mining, a free open-source data science platform. Rank and correlation widgets were used to compare relationships between the synthetic data and the original data. Rank used information gain to determine the strongest predictors of the outcome. The correlations widget used the Pearson method.
Results
- GPT-4 — it created only 10 demonstration rows, too small to analyze.
- Claude 3 Opus — stated it could not complete the task as written. I asked if it could create a table of 100 subjects and it created a table of 10 subjects using random data. The results were too small to analyze.
- Claude 3 Sonnet — it created a table of 60 subjects, comma separated with a copy button. These data were pasted into a spreadsheet and uploaded to Orange. Rank and correlations did not match the original data.
- Gemini Pro — created a table with 31 rows only. Not analyzed due to small size.
- Gemini Advanced — not tested in this trial because it does not accept tabular data uploads
- Claude Instant — wrote Python code to append new data to the original dataset. It therefore did not follow instructions.
- Mistral Large — it used Python code and SMOTE to create new data. The Python code was inputted into Google Colab but the code was unsuccessful. The error message stated that SMOTE could not deal with missing data.
Trial #2 Few-shot learning with no data example uploaded
Strategy: No data uploaded, just instructions to create a realistic dataset based on the Framingham study
Prompt: “Create a new dataset to predict heart disease. Model the dataset after the Framingham study. Create 100 rows with column headers of age, gender, education, smoking (0,1), hypertension (0,1), diabetes (0,1), stroke (0,1), HDL cholesterol, systolic_blood_pressure, diastolic_blood_pressure, BMI, heart rate and the target heart disease (0,1). The dataset created should be able to be downloaded as a CSV file or copied from a table. “
Results
- GPT-4 — Used Python to create data that was randomly selected and created a table of 10 rows of data only. Not analyzed due to small size.
- Claude 3 Opus — Created 72 rows, comma separated. I used the“copy message” to paste data into a spreadsheet. These data were uploaded to Orange. Rank and correlations were not similar to the Framingham dataset.
- Claude 3 Sonnet — Created a comma-separated dataset of 60 subjects with a copy button. Data was pasted into a spreadsheet and then uploaded to Orange. Rank and correlations did not reflect the Framingham dataset.
- Claude 3 Haiku — Created 58 comma-separated examples. Data was pasted into a spreadsheet and then uploaded to Orange Rank and correlations did not reflect the Framingham dataset.
- Gemini Pro — Created table with 40 subjects. This was pasted into a spreadsheet and uploaded to Orange. Rank and correlations did not reflect the Framingham dataset.
- Gemini Advanced — Initially only showed 10 subjects and after two more prompts, it continued to display only 18 subjects. It was exportable to Google Sheets and uploaded to Orange. Rank and correlations did not reflect the Framingham dataset.
- Claude Instant — Python code was generated with random variables and the output was a CSV file that was uploaded to Orange. Rank and correlations did not reflect the Framingham dataset.
- Mistral Large — Python code, random variables. The Python code was copied and pasted into Google Colab but the process was unsuccessful, with several error messages.
Trial #3 No-shot learning
Strategy: Ask the LLMs to create data with realistic risk factors but no data examples. It presumed the LLM could determine heart disease risk factors.
Prompt: “Create a dataset of 100 patients with typical risk factors for heart disease with the outcome of heart disease (yes = 1, no = 0). Create a table or use code. “
Results:
- GPT-4 — It created Python code and 10 demo rows and a table was produced. Not analyzed due to small size.
- Claude 3 Opus — Table of 8 columns and 60 rows. This was pasted into a spreadsheet and then uploaded to Orange. Rank and correlations yielded unrealistic results in predicting heart disease.
- Claude 3 Sonnet — Python code was generated with random variables. The code was pasted into Google Colab and exported as a CSV file that was uploaded to Orange. Rank and correlations yielded unrealistic results in predicting heart disease.
- Claude 3 Haiku — A table was created with 52 subjects. The second request yielded 52 subjects. This was pasted into a spreadsheet and uploaded into Orange. Rank and correlations yielded unrealistic results in predicting heart disease.
- Gemini Pro — It created a table with 7 columns and 40 subjects and Python code with random variables. These were copied into a spreadsheet and uploaded to Orange. Rank and correlations yielded unrealistic results in predicting heart disease.
- Gemini Advanced — Python code was generated and a table of 18 subjects. The code was copied into Google Colab, and exported to a CSV file, and uploaded to Orange. Rank and correlations yielded unrealistic results in predicting heart disease.
- Claude Instant — 8 columns and table of 30 subjects. This was not analyzed due to the small size.
- Mistral Large — Python code and table of 10 rows with random variables. The Python code was unsuccessful as it tried to append data not create new data
Trial #4 No-shot learning
Strategy: Ask the LLMs to create data with specific risk factors but mirror it after the Framingham study
Prompt: “Create a dataset of 100 subjects. The dependent variable is heart disease (0,1) and the independent variables are male (0,1), age, smoker (0,1), history of stroke (0,1), history of diabetes (0,1), total cholesterol, hypertension (0,1), systolic blood pressure, diastolic blood pressure, glucose, and heart rate. Model the relationships between the dependent and independent variables after the Framingham study. Create a comma-separated format and or Python code.”
Results:
- GPT-4 — It created Python code 0f 100 subjects that were pasted into Google Colab and generated a CSV file. This was uploaded into Orange. Rank and correlations yielded unrealistic results for heart disease compared to the actual Framingham study.
- Claude 3 Opus — It created 83 subjects, in a comma-separated format that was pasted into a spreadsheet. This was uploaded into Orange. Rank and correlations yielded unrealistic results for heart disease compared to the original Framingham study data.
- Claude 3 Sonnet — It created Python code that was pasted into Google Colab. This was saved as a CSV file. This was uploaded into Orange. Rank and correlations yielded unrealistic results for heart disease compared to the original Framingham study data.
- Gemini Pro —A table of 39 subjects was created and pasted into a spreadsheet. This was uploaded into Orange. Rank and correlations yielded unrealistic results for heart disease compared to the original Framingham study data.
- Gemini Advanced — Python code was generated to create 100 subjects but the variables were random. Instructions were given to adjust the model to more closely resemble the Framingham study.
- Claude Instant — 10 sample rows only were created. Python code was also generated but with random variables. This was pasted into Google Colab, exported as a CSV file, and then uploaded to Orange. Rank and correlations yielded unrealistic results for heart disease compared to the original Framingham study data.
- Mistral Large — Python code was generated with the warning that the relationships would not mirror the Framingham study. This was pasted into Google Colab, exported as a CSV file, and then uploaded to Orange. Rank and correlations yielded unrealistic results for heart disease compared to the original Framingham study data.
Trial #5 Few shot learning using a data example with a numerical outcome
Strategy: Upload a dataset on predicting insurance cost and ask the LLM to create patients based on known relationships in the data.
Prompt: “Analyze the uploaded CSV file and create 50 subjects with the same relationships between the dependent and independent variables. The outcome is insurance cost. Create a comma-separated dataset that can be copied or Python code.”
Results:
- GPT-4 — Python code was created and pasted into Google Colab. This generated a syntax error.
- Claude 3 Opus — Python code was created and pasted into Google Colab. This produced multiple syntax errors.
- Claude 3 Sonnet — It created 50 subjects but the rank was opposite for the synthetic data vs the original data. Age was most important for the original data and least important for the synthetic data. Correlations were very strongly positive in the synthetic dataset and entirely different from the original dataset.
- Gemini Pro- A table of 74 subjects was created and copied into a spreadsheet and then uploaded into Orange. Compared to the original data the correlations were similar but different magnitudes and the ranks were quite different.
- Claude Instant — it generated 100 subjects in a comma-separated format. Compared to the original data the correlations were similar but different magnitudes and the ranks were quite different.
- Mixtral Large — Python code was created and pasted into Google Colab which produced multiple syntax errors.
Conclusions:
- Python code created 100 subjects as directed, but always used random variables that did not mirror the uploaded data or real-world examples. In Trial 5 Gemini Pro and Claude Instant generated more subjects than requested. The Python code frequently contained syntax errors.
- Claude 3 Sonnet created the most subjects and in a downloadable format
- The results were similar, regardless of whether the outcome was categorical or continuous.
- Neither Python code nor comma-separated data generated data with realistic relationships, as directed. This was true even when an example dataset was uploaded.
- Different prompts might yield different and better results. More trials are warranted.
- Based on the experience of the author, cloning an existing dataset with an AI synthesizer, and not an LLM appears to be the best option for creating synthetic realistic data.