Evidence-Based Data Science — We Aren’t There Yet

Part 2

In part 1 of this series, I addressed the problem of machine learning (ML) and artificial intelligence (AI) reported studies not adhering to existing evidence-based guidelines and as a result, were often felt to be of low quality. This is a significant issue given the proliferation of predictive analytical studies. For example, almost 800 predictive studies have been reported on cardiovascular disease alone. I mentioned several guidelines such as TRIPOD and CHARMS that outlined how these studies should be conducted and reported. There are also “risk of bias” guidelines, such as the Prediction model Risk Of Bias ASsessment Tool (PROBAST) I did not mention. This tool evaluates the risk of bias and the applicability of predictive diagnostic and prognostic studies. Bias, in this context means any study factor that negatively impacts the model outcome or performance.

I did not mention specific deficiencies related to ML/AI modeling which I will cover in part 2. The significant deficiencies can be organized into the following categories:

  • Model evaluation
  • Model performance
  • Generalizability
  • Data challenges
  • Lack of power analysis

Model evaluation. The term validation causes confusion as it is used imprecisely. Validation in this context means model evaluation. Some have used the term validation for a small hold out dataset used to tune algorithm parameters. Recent guidance suggests a better term would be the “tuning” dataset. The most common method for model validation is a train/test split of the data. For example, 70% of the data can be used for training and the remainder for testing model performance. This approach is also called internal validation because the data used for validation were derived from the original dataset. The problem is that this produces overly optimistic results because the test data is so similar to the training data. While k-fold cross-validation is an acceptable alternative for internal validation, many experts are now recommending bootstrapping where multiple new random samples are created and the data returned to the original dataset (replacement) and this process is repeated multiple times. Bootstrapping helps with limited data, reduces overfitting, and can produce confidence intervals. Further details can be found here. Experts also recommend using external data for validation. That could mean data derived from a different source (another hospital or clinic), or another time frame. An example could be that the model was trained on the NHANES 2015–2016 dataset but tested on the 2017–2018 dataset. The reality is few studies report external validation. A 2019 study of AI used for medical imaging reported that only 6% of studies used external validation. In addition, some authorities recommend testing be conducted ideally by an independent group.

Model performance. Although rarely reported together, experts recommend reporting calibration and discrimination on all ML and AI modeling studies.

  • Calibration refers to the agreement between the observed outcome and the prediction. It should be verified by a calibration plot and/or a Hosmer-Lemeshow test. The following displays a calibration plot for observed and predicted mortality at 1 year for heart failure
  • Discrimination means how well the model differentiates or discriminates between subjects with the condition of interest versus those who don’t have the condition. Discrimination is reported as the receiver operator characteristic curve (C-statistic or AUC) with confidence intervals. If a classification model is created the measures should include at a minimum sensitivity and specificity.

Generalizability. The data used to create and test a model must reflect the population of interest. Many models use data that is simulated or synthetic (known as “in silico”) and therefore may not generalize to the group you are studying. In addition, some datasets are readily available but inappropriate for all settings. As an example, many researchers have used the MIMIC III de-identified dataset of 63,532 ICU stays to create models, but ICU patients are different from regular hospital inpatients and outpatients.

Data Challenges. There is a shortage of high quality, high volume patient-level data that can be shared. This is most commonly due to privacy concerns associated with HIPAA regulations. Lack of appropriate data is often the reason researchers opt to use simulated data

  • Limited choices — much of machine learning modeling and education is based on a limited number of datasets that have been used repeatedly, such as the iris, Titanic, and MNIST datasets, to mention a few. We need more modern, larger, and better datasets
  • Small datasets — a 2016 study on predictive analytics reported that the median size of subjects in the reviewed studies was 445. Keep in mind that when the classic train/test split of data is used, this results in a small test dataset (445 x .3 = 133). If the minority class is 5% (e.g. cancer) of the total dataset then the test set could only have about 7 examples of the minority class (133 x .05 = 7). Machine learning produces better results with larger datasets and less of a chance of overfitting.
  • Low quality — most AI studies use supervised learning which means the images must be labeled before training the model. For example, “malignant pulmonary nodule, left upper lobe.” These labeled datasets are not always verified or the origin of the interpretation is unknown.
  • Lack of data transparency — a minority of datasets are public so researchers or reviewers cannot independently run their own models and corroborate the results. Some experts have opined that if the data cannot be shared then the source code should be shared instead. Many authors do include their code on e.g. GitHub
  • Lack of prospective data — the majority of data for modeling is based on retrospective observational and not prospective data so the results are less likely to be valid. An ML or AI model should be confirmed by a prospective controlled trial and reported in a peer-reviewed medical journal. Nagedren reported that they only found one randomized trial of AI medical imaging registered in the US, in spite of the fact that 16 deep learning algorithms for medical imaging were approved by the FDA.
  • Test data — test datasets should be the same for humans and AI — in studies where AI accuracy was compared to humans for image interpretation, the test set was frequently different between the two arms of the study. The test data should be the same for both.

Lack of Power Analysis — high-quality studies are “powered” to show a specific effect size based on a calculated sample size. Most studies related to ML and AI do not mention a power analysis.

There are many other areas that need improvement in conducting and reporting ML and AI modeling studies for prediction. Many of these topics are complex and beyond the scope of this article. Readers are referred to the various guidelines for further details.

We will continue to see new guidelines being published in this area. In June 2020 MINimum Information for Medical AI Reporting (MINIMAR) was published for medical AI reporting guidance. As mentioned, TRIPOD will release more detailed guidelines regarding ML and AI and other guidelines such as CONSORT and SPIRIT plan to extend their guidance to AI. This will always be a work in progress, given the dynamic nature of data science.

As with any new scientific field, it takes time for practice to catch up with national or international guidelines. Medical journals need to insist on adherence to ML/AI guidelines for publication.

--

--