Vizly: AI-Enabled Data Analytics

--

Part 2

In part 1 of this series, I explored Vizly’s features and demonstrated several data visualizations and basic analyses. In part 2 I will evaluate two datasets and use more sophisticated techniques. Vizly results will be verified by two external data science resources.

In part 1 of this series, I reviewed the general features of Vizly and conducted some basic data science functions to demonstrate exploratory data analysis. I did make the point that all datasets should first be explored in a spreadsheet before uploading to Vizly. The second point made was Vizly is optimally used by those individuals with an intermediate data science background so they would know how to ask appropriate questions and be able to judge the accuracy of responses. I should add that Vizly could be used in a data analytics course where faculty can educate students as to the optimal ways to leverage Vizly.

Part 2 will be a deeper dive into two datasets to see how well Vizly performs. I will check its performance against two open-source data science programs. (Orange and JASP).

WHO Life Expectancy Dataset

Let’s continue with the WHO Life Expectancy dataset we used in Part 1 as there are many insights to be gained. If you have a dataset where no outcome has been identified and you want to determine if any interesting groups/clusters exist, unsupervised learning is warranted. K-means clustering “uncovers” clusters or groups in your data. There is an art to determine the optimal clusters that should be analyzed.

Initially, Vizly created a plot with 3 clusters but I thought 2 might be more appropriate. One approach to determine the optimal number of clusters is to request a silhouette score and that is displayed in the figure below. You should select the highest silhouette score and that is a score = 2.

Choosing k = 2 produces an image below with two clusters. The y-axis is gross domestic product (GDP) and the x-axis is life expectancy. Cluster 1 is lighter green and Cluster 0 is dark green. The plot suggests that Cluster 1 is associated with a low GDP and Cluster 0 is associated with a high GDP.

I asked Vizly to describe both clusters. Below you can see the descriptive statistics associated with the two clusters, with only the top 5 characteristics displayed.

These insights provide a clear differentiation between the two clusters. Cluster 0 generally represents countries with higher life expectancy, lower adult mortality, lower infant deaths, higher alcohol consumption, and greater health expenditure. Not displayed are better immunization rates, higher GDP, and better schooling. What does this mean? Cluster 0 represents developed countries and Cluster 1 represents developing countries.

The correlation of increased alcohol consumption and life expectancy was interesting and a bit surprising. The estimated alcohol consumption is measured in liters of pure alcohol per person aged 15 or older, per year. For example, wine is about 0.12% pure alcohol, so 50 liters of wine = 6 liters of pure alcohol. I asked Vizly to display alcohol consumption for developing and developed countries and determine if there was a threshold where alcohol consumption began to be deleterious to life expectancy. The plot below suggests that life expectancy diminishes after the consumption of greater than 11 liters of pure alcohol per year per person.

Therefore, the issue of alcohol and life expectancy is more complicated because it correlates with higher education, income, GDP, and health expenditures. In other words, alcohol is more dangerous when imbibed in countries with multiple health challenges.

Framingham Heart Disease Prediction Dataset

This dataset was selected because it is a public dataset with public health implications. This study followed subjects to determine if they developed coronary heart disease by year 10 (TenYearCHD). I asked Vizly to plot those people who developed heart disease vs those who did not. In the diagram below you can see that those who did not develop heart disease (group = 0, count = 3596) greatly exceeded those who did (group = 1, count = 644). In other words, only 15% developed heart disease in the ten years. When the outcome of the group you are interested in (outcome or class) is in the minority this is known as “class imbalance” and must be addressed because it affects prediction model results

Before I asked Vizly to deal with the class imbalance, I cleaned and processed the data. This consisted of the following steps:

  1. Summarize this data — confirmed by external sources
  2. Provide descriptive statistics — confirmed by external sources
  3. Generate a correlation matrix and heatmap — confirmed by external sources
  4. Impute missing data (median for continuous data and mode for categorical data)
  5. Examine the data for multicollinearity by determining if there are VIFs >10 — confirmed that the results were incorrect by external sources
  6. I asked Vizly to look for challenges related to the data and it emphasized the importance of handling missing data, looking for multicollinearity, and addressing class imbalance. In addition, it noted that education had 4 categories so this should be handled by dummy encoding which was done and visualized.

I then gave Vizly the following prompt: “Create a classification model to predict TenYearCHD using logistic regression with LASSO and random forest. Report the AUC, accuracy, F1 score, recall, precision, specificity, and MCC for both algorithms” and that produced the baseline training performance. I then gave it this prompt “I want to address the class imbalance. Calculate the recall, F1, precision, and MCC scores using logistic regression with LASSO and random forest, but use three approaches 1. adjust algorithm weights to balance class distribution 2. Shift the threshold to determine the best threshold 3. Apply the SMOTE technique. “

For brevity’s sake, I will not post the results of the classification model, but here is Vizly’s summary “Among the three approaches, adjusting algorithm weights for LASSO Logistic Regression yielded the highest recall and F1 score. However, the random forest model did not show significant improvement with any of the techniques.”

I ran this same analysis in Orange and it showed that weighted random forest increased the F1 score from a baseline of 0.062 to 0.379, indicating substantial improvement and disagreeing with Vizly. It should be noted that in the presence of class imbalance where the model is learning mostly from the majority class, the AUC and accuracy tend to be misleading. Other measurements such as the F1 score (harmonic mean of precision and recall) are more meaningful.

These model results were run on test data. I then asked Vizly to perform hyperparameter tuning on random forest which improved the F1 score to a level similar to the LASSO logistic regression.

The next step was to evaluate model calibration (different from model discrimination). The calibration curve was better for random forest and the Brier score was lower which indicates better calibration. The figure below shows that the predicted probability was closer to the actual probability for weighted random forest.

My last request to Vizly was to summarize the analysis and Vizly concluded that the tuned random forest model was better because it predicted the 0 (no heart disease) class better than LASSO logistic regression and predicted the 1 (heart disease) class about the same.

Discussion and Conclusions

Vizly performed well with more complex tasking except for producing erroneous results for VIFs and failing to show improvement with the weighting of random forest. The commands were repeated with the same results and there was no obvious explanation for these errors.

The future of AI-enabled data analytics is very bright and we can expect more programs on the horizon and more integrations with spreadsheets and other sources of data. Vizly does produce excellent questions and I suspect as a user gets more comfortable with the program there will be good synergy between the human vs AI-generated questions. The goal would be to maximize the information gained from any dataset.

In my opinion, the best approach to any new dataset is to first examine it visually and completely before initiating an AI-driven inquiry. Currently, there should be a knowledgeable “human in the loop” and double-checking results with an additional program would be wise.

--

--