Vizly: AI-Enabled Data Analytics

Part 1

We are rapidly moving towards AI-supported data analytics. Current programs, such as Vizly, can produce a variety of analyses and data visualizations. This approach will save time and augment data analytics by data scientists and others. This post will provide a general review of Vizly's features and capabilities along with the results of a dataset analysis.

Generated by DALLE 3

Introduction

In a recent July 2024 Medium post “High Performance Predictive Analytics Without Programming” I reviewed alternatives to programming, such as low-code and no-code solutions. I mentioned that most large language models (LLMs) e.g., GPT-4o can evaluate datasets but are limited by the inability to generate plots and graphs and are associated with frequent errors. Most LLMs will also generate code (Python, R), but to generate visualizations the code must be pasted into a programming notebook such as a Jupyter notebook. In the post, I did not discuss that there are newer programs that leverage large language models (LLMs) that are more error-free and can generate data visualizations without cutting and pasting code into a notebook. The first program to accomplish this was GPT-4 by OpenAI combining a data analytics analysis (ADA) package with the LLM. Following the OpenAI initiative, other AI-enabled data analytics programs have appeared.

I will focus on Vizly, an advanced data analytics package that has significant potential for data scientists and others. I learned about Vizly from a June Medium post that outlined many of the features of Vizly but took the position of evaluating the platform from a programmer’s perspective. Vizly will display code (Python or R) but my position will be to evaluate Vizly from a non-programmer’s perspective. Others, such as Dr. Lyndon Walker have already done a deep dive on Vizly with positive results posted on YouTube. My approach will be to evaluate Vizly by determining what it offers for those who do not program and who prefer to use natural language prompts for data analytics. This approach should appeal to faculty and students.

The goal of Part 1 will be to review the following: 1. The features of Vizly 2. Its performance with a public dataset 3. The error rate 4. The appropriateness of data visualizations. 5. The validity of LLM raised questions.

Part 2 will be a deeper dive to evaluate more complex data analyses.

Vizly Background

Vizly was created in 2023 by Sami Sahnoune and Ali Shobeiri, graduates of McGill University. They have extensive expertise in data visualization and machine learning with work experience at Apple, the NASA Jet Propulsion Laboratory, Microsoft, Splunk, and Plotly.

Vizly Features

Vizly is a web-based platform that accepts the following file types: CSV, Excel, SPSS, PDF, image, and JSON. A free plan is available, but access is limited to 10 messages (queries) per month. Other options include a basic plan at $19.99/month and a premium plan of $29.99/month. For students and faculty, there is a 50% discount on the premium plan. The premium plan includes unlimited messages per month and priority support. Enterprise and custom pricing are offered.

Vizly can be integrated with Snowflake, Redshift, BigQuery, SQL Server, and Postgres for organizations using a customized connection. They also offer a REST API for custom applications.

The default LLM used is GPT-4 and other choices include GPT-4o, GPT-4o mini, Claude 3 Opus, and Claude 3.5 Sonnet. Results can be exported as PDF, .docx, HTML, and .ipynb file formats. Results can be shared via a generated web link. Vizly can generate univariate, bivariate, and multivariate visualizations.

Once a dataset is uploaded, the program will begin asking questions about what functions you would like to see. A logical starting point would be to ask for a data summary that would describe the number of rows and columns and an explanation of column headers.

Alternatively, you can request exploratory data analysis that will generate basic descriptive statistics and data visualizations. Vizly can perform data preprocessing, data visualization, supervised (classification/regression), and unsupervised learning. Based on the data, the program will generate suggestions or ask questions. Users can opt to view the code to see what commands and packages were used for the analysis. Users can rate the response as good or bad. The latter will allow the user to report any issues to the developers.

Example of AI-Enabled Data Analytics

To demonstrate Vizly in action, I selected the World Health Organization dataset for life expectancy, found on Kaggle. The first command was to summarize the data, which generated the following for each column: count, mean, std, min, 25%, 50%, 75%, and max, with an option to download this table. It reported that there are 2938 rows and 22 columns and gave a written explanation of each column header.

I then asked for the number of missing data in each column and received the following result.

The program asked me if I would like to handle the missing data in a specific manner, did I want to visualize the distribution of the missing data and whether I wanted to use imputation. For simplicity's sake, I asked it to impute missing numerical data with the median value and impute missing categorical data with the mode. It then asked if I would like to see the summary statistics again after imputation, a correlation analysis, and a distribution of life expectancy. I asked it to plot life expectancy distribution and the results are seen below. The most common life expectancy is the early to mid-70s.

There are several chart options: edit, download, and expand. Editing allows the user to change the chart type, orientation, and name of the x and y axes. You can zoom in or out and crop the image. The charts are interactive, meaning you can mouse over a data point and get exact information.

The next questions Vizly asked were 1. Would you like to analyze the correlation between life expectancy and other variables? 2. Would you like to visualize the relationship between GDP and life expectancy? 3. Would you like to perform a regression analysis to predict life expectancy?

I asked Vizly to analyze the positive and negative correlations between life expectancy and other variables and it generated the results of a Pearson correlation. It then gave a summary of findings as noted below

It then asked if I wanted to see a visualization comparing schooling and life expectancy because of the high correlation of 0.713 and it produced the following scatterplot.

I then asked Vizly to create a regression model to predict life expectancy using linear regression and random forest and report the RMSE and R squared values. I did not ask it to split the data into training and testing nor did I ask it to perform cross-validation. The result below shows that random forest produced better results compared to linear regression.

The next questions generated by Vizly were 1. Would you like to determine feature importance for the random forest model? 2. Would you like to compare the predicted vs actual life expectancy results? 3. Would you like to perform cross-validation to validate model performance?

The feature importance chart below confirms the presence of HIV/AIDs as the most important feature for the random forest model.

I requested a calibration plot to show how close the predicted life expectancy results were to the actual. The figure below shows the model calibration was good for linear regression as most data points are close to the 45-degree line.

Vizly then asked if I wanted to create a time series analysis of the life expectancy data. Below is a line chart that shows that average life expectancy rose from 66.5 to 71.5 during the period 2000 to 2014.

Discussion

Vizly performed very well from the standpoint of generating reasonable questions to explore the dataset and providing appropriate visualizations. In this evaluation, no errors were encountered. Analysis speed was good using GPT-4o.

I do want to emphasize two important points. All datasets should be visualized first in a spreadsheet (Excel, Google Sheets) to assess the quality of the data, missing data, errors, abbreviations, and so forth. This must be done before leveraging AI. The second point is that we seem to be moving towards AI-enabled analytics in multiple industries so we must evaluate this new approach. Individuals do this best with at least an intermediate knowledge of data science. Without some expertise, users will not know what questions to ask, and they will be unable to judge AI-generated suggestions and the results. To be clear, AI can make mistakes so it must be critically evaluated.

Conclusions

Programs like Vizly represent a new trend in data analytics. They will undoubtedly save time and likely improve the quality of the analysis. They will keep improving over time based on the rapid improvement of LLMs. Data scientists need to get involved in evaluating this new trend.

--

--