High-Performance Predictive Analytics Without Programming

--

Robert E Hoyt and David Patrishkoff

1 Introduction:

Most industries are data-driven because they are more likely to be financially competitive and profitable. According to a McKinsey study, “intensive users of customer analytics are 23 times more likely to outperform their competitors in terms of new-customer acquisition than non-intensive users are, and nine times more likely to surpass them in customer loyalty.” [1].

This transition to data-driven businesses led to the hiring of more data analysts and data scientists in the past decade. One of the obstacles to this goal is the shortage of specialists. The U.S. Bureau of Labor Statistics projects a growth rate of approximately 28% in the number of data science skill-related jobs by 2026. [2] However, Deloitte noted that the USA is projected to have a shortage of 250,000 data scientists by 2024. [3] To make matters worse, in this new era of artificial intelligence, seventy-five percent of C-suite executives believe that if they cannot pursue substantial AI efforts in the next five years, their businesses may not survive. [4]

Traditionally, data scientists have relied on a programming language to explore and analyze data and create various prediction models. This approach is generally taught in masters-level courses that also stress higher mathematics and statistics. Learning a programming language is associated with a steep learning curve and other challenges. Not being tied to a specific programming language allows students to focus more on the fundamental concepts of data analysis, statistical modeling, and machine learning rather than being bogged down in the details of coding syntax and debugging. Importantly, not needing to program makes the field open to a broader audience.

The challenges associated with learning a programming language led to low-code and no-code alternatives. This movement has impacted software and website development, but in this article, we will focus on its impact on data science. One author defines the no-code approach as one that “uses a programming method that does not involve writing code. Instead, users work with a graphical user interface (GUI) where they can use templates, logical sequences, and drag-and-drop to achieve the desired output. The latter can be data ingestion, transformation, analysis, machine learning, etc.” [5] According to Gartner, the no-code and low-code platform market grew by an average of more than $1 billion per year, from $3.47 billion in 2019 to $8 billion in 2022. It is forecasted that the no-code and low-code platform markets will grow by almost 20% from 2023 to $10 billion and to $12.3 billion in 2024. [6]

Creating a broader audience for data science that does not require programming leads to the “democratization” of data science. There are several advantages to democratization:

  • Improved decision-making: By harnessing the collective insights of a data-literate workforce, organizations can make better decisions based on factual evidence rather than intuition or guesswork.
  • Increased innovation: Democratizing data empowers individuals to identify new opportunities and develop data-driven solutions, leading to increased innovation and agility.
  • Enhanced operational efficiency: Everyone understanding and utilizing data can lead to streamlined processes, improved resource allocation, and cost optimization.
  • Greater employee engagement: When employees feel involved in data-driven initiatives and observe the impact of their data analysis, their engagement and sense of ownership increase.
  • Better communication: Enlightened workers can communicate with data scientists and others if they have at least basic data science skills. [7]

To ensure that this paradigm shift succeeds, several actions must occur:

  • Empowering employees: This involves fostering a culture where everyone feels comfortable asking questions about data and providing training and resources to help them develop basic data literacy skills.
  • Data accessibility: Making relevant data readily available and understandable, ensuring data quality and security while promoting responsible data usage.
  • Shifting mindsets: Encouraging collaboration between data experts and nontechnical users, fostering a data-driven decision-making approach throughout the organization. [7]
  • Providing user-friendly tools: This approach involves equipping people with intuitive and accessible tools for data exploration, visualization, and analysis that do not require coding knowledge. [8]

As noted previously, it is important to have a collection of intuitive and affordable tools available to conduct data science and collaborate with others. There are several ways to accomplish data science with a low-code or no-code approach, as outlined in the next section. [9] We will demonstrate the ability to perform high-performance predictive analytics using one of these tools in several additional sections.

2 Low-Code and No-Code Data Science Programs

This section will provide examples of low-code and no-code programs, but it is not an exhaustive list.

2.1 Low-code tools Low-Code AutoML automates machine learning programming steps so that more steps are performed with fewer lines of code. Examples of low-code AutoML include PyCaret, Pandas Profiling, and MLBox. [10–12]

2.2 Graphical user interface (GUI)-based tools These tools are also known as visual programming tools and are based on a drag-and-drop interface. Connecting two operators or widgets results in a data science function, such as creating a histogram. Two commercial examples are KNIME, which also offers a free community version and a paid version, and Altair RapidMiner, which is commercial only. [13–14] The program we will highlight in this article is Orange Data Mining. Another section is devoted to this platform and a use case in predictive analytics using Orange. [15]

2.3 Other No-Code Data Science Software packages

  • WEKA 3 is an open-source machine learning platform created by the University of Waikato, New Zealand. [16]
  • No-Code AutoML: These programs accomplish many data science tasks by automating machine learning steps without coding. Examples include Google AutoML, H2O AutoML, Microsoft Azure AutoML, Datrics, and BigML AutoML. [17–21]
  • No-Code AutoAI: IBM Watson AutoAI, Obviously AI, Clarifai, Crowd AI, DataRobot, and Dataiku. [22–27]
  • No-Code AutoNLP: HuggingFace AutoNLP, and MonkeyLearn. [28–29]
  • No-code data visualization: Tableau has a community version for data visualization. [30]
  • JASP is an open-source statistical program based on R programming in the background and includes algorithms for supervised and unsupervised learning. [31]
  • BlueSkyStatistics (BSS) offers a free and commercial program covering extensive statistical functions, including machine learning algorithms. [32]

2.4 Large Language Models (LLMs). LLMs are included because they can perform many steps in machine learning with or without coding. Many LLMs analyze and summarize an uploaded dataset, and when asked, they can explain each step in the machine learning pipeline. To perform data visualization, most LLMs require the associated Python code to be pasted into a programming notebook, such as the Jupyter Notebook, to see the image. While the results can be extremely impressive, there are issues such as the lack of some Python packages, occasional errors, and the need to continuously remind and advise the LLM to see results. This approach is advisable for individuals with intermediate to advanced data science or programming experience. [33]

3 Orange Data Mining

Orange is an open-source data visualization and analysis tool that enables users to perform data science tasks without writing code. Developed by the Bioinformatics Lab at the University of Ljubljana, Slovenia, Orange is a versatile platform that caters to a wide range of users, from beginners to experts. The computational engine used is Python, which operates in the background.

One of Orange’s key features is its user-friendly interface. The platform utilizes a drag-and-drop approach, allowing users to easily create workflows by connecting different data analysis components (widgets). Orange provides a comprehensive set of widgets (200+) for data preprocessing, imputation, normalization, visualization, and analysis, which can be customized to meet the specific needs of the user. These widgets are organized into categories such as Data, Transform, Visualize, Model, Evaluate, Unsupervised, Image Analytics, Time Series, Survival Analysis, Text Mining, Bioinformatics and Geolocation.

The File widget is preloaded with six common datasets. In addition, users can upload their own data, and the data can be downloaded from a URL, e.g., from a Google Sheet. There is also a Dataset widget that contains 90 datasets, including a variety of data types that can be used for exploration and modeling. Orange supports the CSV and Excel input formats and is available for Windows and MacOSs. Orange also incorporates a range of machine learning algorithms, including classification, regression, clustering, and association rule mining. Orange provides extensive tools for evaluating supervised and unsupervised learning models. This enables users to assess the performance of their models and choose the best model for their specific use. There is a Python script widget that allows users to create custom widgets. For example, we were able to create SMOTE and Brier score widgets that are not included in the default Orange widgets.

Orange has a large and active community of users and developers who contribute to the platform by creating new components and providing support to other users. Extensive videos are available for Orange platform education. [15] Orange has been utilized as the data science platform in multiple published articles on predictive analytical methods. [34–36]

4 Data Analytics Use-Case Using Orange

4.1 Published Approach

In the article by Chicco et al., machine learning could predict the survival of patients with heart failure based on the serum creatinine concentration and ejection fraction alone, they created a classification model to predict death due to heart failure. [37]. The dataset can be found on the UCI machine learning repository. [38] The dataset consisted of 299 subjects (105 women, 194 men) aged 40 to 95 years. Two hundred and three patients survived (68%), and 96 died (32%). The categorical predictors were anemia, hypertension, diabetes, sex, and smoking. The numerical predictors were age, CPK, ejection fraction (EF), platelet count, serum creatinine concentration, sodium concentration, and follow-up time in days. There were no missing data.

After a 70/30 split and 10-fold cross-validation, a classification model was created with the following algorithms: random forest, decision tree, gradient boosting, artificial neural network (ANN), logistic regression, naive Bayes, support vector machine, k-nearest neighbor (knn) and one-rule. R programming was the computational engine. The SVM, knn and ANN algorithms underwent hyperparameter optimization (HPO) via grid search. The classification model performance results are summarized in Table 1. The time to follow-up was not included in the model.

Table 1

Table 1 Classification model performance metrics created with programming. The best performer is bolded in each column. ANN = artificial neural network, NB = naive Bayes, KNN = k-nearest neighbor

The strongest predictors were ejection fraction and serum creatinine based on the biostatistical and machine learning techniques described in the article.

4.2 The No-Code Approach

Essentially, the same algorithms are available in Orange, except for the one-rule algorithm. Preprocessing consisted of converting categorical variables to continuous variables and normalizing the data using the Preprocess widget. There was no missing data.

A classification model was created after a 70/30 training/test split with stratification. Model performance results were reported based on the test set. Logistic regression can be adjusted by using no regularization, LASSO regularization, or Ridge regularization; the latter two are reported. Logistic regression can also be weighted for class imbalance, as can random forest, and the reported results are weighted. ANNs have multiple HPO options, such as the number of hidden layers, number of neurons per hidden layer, optimizer, activation type, regularization level, and number of iterations. The default settings were used: one hidden layer with 100 nodes, ReLu activation, an Adam optimizer, regularization = 0.08, and 200 iterations. There are four options for gradient boosting: gradient boosting, catboost, xgboost, and extreme gradient boosting random forest. The default gradient boosting algorithm used was xgboost, and regularization was adjusted to achieve the best F1 score. For this study, gradient boosting, catboost, and xgboost algorithms were used. The Orange workflow used to create this classification model is shown in Figure 1.

Figure 1 Orange Workflow

Table 2 Classification model performance metrics created without programming on test data. The best performer is bolded in each column. ANN = artificial neural network, KNN = k-nearest neighbor.

Table 2

In the face of class imbalance, the original article reported F1, MCC, and AUPRC, which are more appropriate metrics for reporting when dealing with class imbalance. [39] The accuracy and AUROC are less accurate when class imbalance occurs. There are many methods for dealing with class imbalance, such as over- or undersampling, SMOTE, threshold shifting, weighting, and others. [39] The ROC threshold in Orange was shifted to 0.34, which improved the TPR for logistic regression (ridge) to 0.708, but the specificity decreased to 0.477. We also used the calibrated learner widget, which uses sigmoid and isotonic calibration, but this did not improve the F1 score. Finally, we used the stacking widget package, which included xgboost, and naive bayes as learners, and logistic regression (ridge) as the aggregator; the results are displayed in Table 2. Table 3 compares the leading performance results between the programming approach used in the original article and the nonprogramming (orange data mining) approach.

Table 3

Table 3 Seven classification model metrics comparing the programming and nonprogramming approaches. The best results are bolded. RF = random forest, ANN = artificial neural network, NB = Naive Bayes, XGB = xgboost, LR = logistic regression.

As shown in Table 3, the Orange approach resulted in better results for four of the seven metrics, and stacking yielded the best F1 score.

The strongest clinical predictors in our analysis were also the serum creatinine concentration and ejection fraction. The Rank widget rated serum creatinine as the strongest predictor at 0.119 using information gain, and the second strongest predictor was ejection fraction at 0.080. (Figure 2)

Figure 2 Rank Widget Results

The Feature Importance widget uses permutation to determine which features are the strongest predictors based on a specific metric such as the F1 score. [39] Using the xgboost as the algorithm and the F1 score as the metric, the serum creatinine concentration was the strongest predictor, followed by the ejection fraction. (Figure 3).

Figure 3 Feature Importance Widget Results

Model discrimination is commonly reported, but model calibration has rarely been reported. [39] Good model discrimination does not mean good model calibration. Calibration is important because the predicted probabilities are desirable for reflecting the actual outcome. For example, if a model predicts a probability of 0.8 for a positive outcome, it should be the case that 80% of the time the outcome is actually positive. Figure 4 displays the model calibrations for all the algorithms, and it shows very poor calibrations. Excellent calibration would mean that the calibration plots would follow the 45-degree line. The calibration displayed in Figure 4 is particularly poor, closer to a probability of 1, so the predicted probability is falsely high.

Figure 4 Model Calibration

Classification model discrimination and calibration yielded poor results on this dataset, but the article by Chicco et al was a reasonable article to compare a programming approach with a nonprogramming approach.

5 Discussion

Table 3 shows that the no-code approach is comparable to the approach taken by the authors of the original article that utilized programming. It was not the intent of the article to show that it was better but rather to show that it had similar accuracy and functionality. However, we did report more approaches for dealing with class imbalances than did those reported by the original authors, and we also reported model calibration using Orange.

Our use-case analysis demonstrated that high-performance data analysis and predictive modeling can be accomplished without programming, referred to as the no code data science (NCDS) approach. This approach is simpler and more cost-effective. Although we are unaware of a direct comparison of standard programming compared to visual programming, it is likely that visual programming is faster. With the NCDS approach, there is no steep learning curve associated with programming or the expense of a site license.

An additional advantage of using the Orange software programs mentioned above is related to their educational value. In the case of Orange, there are several unique features that help with data science education that are related to excellent visualization:

  • There is a paint widget that allows a user to create mock data in, e.g., a linear or nonlinear pattern that can be analyzed with further visualizations and modeling.
  • There is a separate educational module with its own widgets:
  • Linear and nonlinear relationships and model performance of the polynomial classification and regression widgets
  • Interactive k-means clustering widget manipulates the centroids and displays the results to educate about unsupervised learning
  • The gradient descent widget shows the incremental steps of gradient descent related to logistic and linear regression
  • Many of the Orange visualizations and other widgets are interactive, which means that one can select part of a plot and therefore isolate those subjects for further analysis. For example, one can highlight a node in the Decision Tree Viewer widget, a bar in a Distribution widget, a cluster in the hierarchical dendrogram, data in the Silhouette widget, a data point in the Scatterplot widget, a box in the Box Plot widget, and a segment in the Sieve or Mosaic plot widgets. In addition, data in the Feature Statistics, Data Table, and CN 2 Viewer widgets can be highlighted, and segmented data can be sent forward for further analysis or saved as a new dataset. Moreover, when an algorithm is adjusted, e.g., through regularization, changes in performance can be observed, and the confusion matrix and ROC curve can be dynamically adjusted.
  • The Mosaic, Sieve, and Scatterplot widgets have the option to “find informative projections”, which saves time in finding interesting relationships.
  • Most widgets have a report option that displays and archives the results of several data science tasks that can be annotated and shared.
  • Multiple algorithms can be run at the same time, so the performances of the algorithms can be quickly compared.
  • The workflow is saved as an OWS file that can be attached to an email or uploaded to Google Drive and shared with students and faculty, like sharing a Python notebook.

There are several limitations associated with the no-code data science (NCDS) approach that should be noted. NCDS programs may have limited functionality when dealing with big data, artificial intelligence, and integration with certain external software programs or databases. [40] Creating a predictive model using an NCDS program is so fast and easy that users may skip extremely important early steps, such as data preprocessing, exploratory data analysis, and hyperparameter optimization. It should be emphasized that, regardless of whether someone takes a programming or NCDS approach, they must follow the logical steps to evaluate the data and prepare it for modeling. The NCDS programs include but do not explain basic statistical tests such as t-tests and chi-squared tests. Users must learn basic statistics in addition to NCDS programs. There are some areas where SPSS and SAS will have an edge, e.g., extremely specialized or computationally demanding analyses. These commercial packages also include dedicated professional support that might be essential for some use cases.

6 Conclusions

There are open-source programs currently available that can perform and teach data science without the need to learn a programming language. More time can be devoted to learning the many core concepts of data science. This approach is appropriate for a large percentage of workers in multiple fields and is an example of the democratization of data science. Extensive data preprocessing and exploratory data analysis can be successfully accomplished with an NCDS approach, and advanced modeling, hyperparameter optimization, and image and text analysis can be achieved. The open-source program highlighted in this article is intuitive and was created with education in mind. There is still a learning curve, but it is not as steep as that associated with learning a programming language.

References:

  1. Bokman A, Fiedler L, Perrey J, Pickersgill A. Five facts: How customer analytics boosts corporate performance [Internet]. McKinsey & Company; 2014 [cited 2024 Feb 7]. Available from: https://www.mckinsey.com/capabilities/growth-marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance
  2. Schatsky D, Chauhan R, Muraskin C. Deloitte Insights. [cited 2022 Dec 20]. Democratizing data science to bridge the talent gap. Available from: https://www2.deloitte.com/us/en/insights/focus/signals-for-strategists/democratization-of-data-science-talent-gap.html/
  3. Rieley M. Big data adds up to opportunities in math careers. Change. 2018;2016:26. Janssen, N. (2022, October 11). US Bureau of Labor Statistics. https://www.bls.gov/opub/btn/volume-7/big-data-adds-up.htm
  4. Accenture. (2019, November 14). Failure to scale artificial intelligence could put 75% of e talent gap. Deloitte Insights. Retrieved December 20, 2022, from https://newsroom.accenture.com/news/failure-to-scale-artificial-intelligence-could-put-75-percent-of-organizations-out-of-business-accenture-study-shows.htm
  5. de Leyritz L. How to build a no-code data stack [Internet]. Castor; 2021 [cited 2024 Feb 8]. Available from: https://www.castordoc.com/blog/what-are-the-no-code-data-tools
  6. Boboyorova C. How Will The No-Code Market Grow In 2023? Forbes Magazine [Internet]. 2023 Apr 4 [cited 2024 Feb 8]; Available from: https://www.forbes.com/sites/forbestechcouncil/2023/04/04/how-will-the-no-code-market-grow-in-2023/
  7. Cornelissen J. The Democratization of Data Science. Harvard Business Review [Internet]. 2018 July 27 [cited 2024 Feb 8]; Available from: https://hbr.org/2018/07/the-democratization-of-data-science
  8. Samarasinghe S, Lokuge S. Exploring the Critical Success Factors for Data Democratization [Internet]. arXiv [cs.CY]. 2022. Available from: http://arxiv.org/abs/2212.03059
  9. White MJ. Does Data Science Require Coding? [Internet]. Springboard Blog. Springboard; 2023 [cited 2024 Feb 7]. Available from: https://www.springboard.com/blog/data-science/data-science-coding/
  10. PyCaret [Internet]. 2022 [cited 2024 Feb 9]. Home. Available from: https://pycaret.org/
  11. PyPI [Internet]. [cited 2024 Feb 9]. pandas-profiling. Available from: https://pypi.org/project/pandas-profiling/
  12. Home — MLBox Documentation [Internet]. [cited 2024 Feb 9]. Available from: https://mlbox.readthedocs.io/en/latest/
  13. Home [Internet]. [cited 2024 Feb 9]. Available from: https://www.knime.com/
  14. Altair RapidMiner [Internet] [cited 2024 Feb 9]. Available from: https://altair.com/altair-rapidminer
  15. Bioinformatics Laboratory, University of Ljubljana. Orange Data Mining. [cited 2023 Oct 28]. Orange data mining. Available from: http://orangedatamining.com
  16. Weka 3 — data mining with open source machine learning software in java [Internet]. [cited 2024 Feb 9]. Available from: https://www.cs.waikato.ac.nz/ml/weka/
  17. Google Cloud [Internet]. [cited 2024 Feb 9]. AutoML. Available from: https://cloud.google.com/automl
  18. H2O AutoML: Automatic Machine Learning — H2O 3.44.0.3 documentation [Internet]. [cited 2024 Feb 9]. Available from: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
  19. Microsoft Azure AutoML [Internet]. [cited 2024 Feb 9]. Available from: https://azure.microsoft.com/en-us/free/machine-learning/search
  20. Datrics AI-Powered Solutions [Internet]. [cited 2024 Feb 9]. Available from: https://www.datrics.ai/
  21. BigML.com [Internet]. [cited 2024 Feb 9]. Available from: https://bigml.com/
  22. AutoML and AutoAI — IBM Watson Studio [Internet]. [cited 2024 Feb 9]. Available from: https://www.ibm.com/products/watson-studio/autoai
  23. Obviously AI. [Internet]. [cited 2024 Feb 9]. Available from: https://www.obviously.ai/
  24. Clarifai. [Internet]. [cited 2024 Feb 9]. Available from: https://www.clarifai.com/
  25. CrowdAI [Internet]. [cited 2024 Feb 9]. Available from: https://www.crowdai.com/
  26. DataRobot AI Platform [Internet]. [cited 2024 Feb 9]. Available from: https://www.datarobot.com/
  27. Dataiku [Internet]. 2019 [cited 2024 Feb 9]. Available from: https://www.dataiku.com/
  28. Hugging Face AutoNLP and Prodigy [Internet]. [cited 2024 Feb 9]. Available from: https://huggingface.co/blog/autonlp-prodigy
  29. Text Analytics [Internet] [cited 2024 Feb 9]. Available from: https://monkeylearn.com/
  30. Tableau [Internet]. [cited 2024 Feb 9]. Available from: https://tableau.com
  31. JASP. [Internet]. 2021 [cited 2024 Feb 9]. Available from: https://jasp-stats.org/
  32. BlueSkyStatistics [Internet]. [cited 2024 Feb 9]. Available from: https://www.blueskystatistics.com/
  33. Lingo R. The Role of ChatGPT in Democratizing Data Science: An Exploration of AI-facilitated Data Analysis in Telematics [Internet]. arXiv [cs.CY]. 2023. Available from: http://arxiv.org/abs/2308.02045 (2023)
  34. Dyah, Aryani, Perwitasari., Zulkarnain, Zulkarnain., M., Rifqi, Rokhman. Models for predicting the quality of life domains on the general population through the orange data mining approach. Pharmaciana, (2022). doi: 10.12928/pharmaciana.v12i1.20827
  35. Thange U, Shukla VK, Punhani R, Grobbelaar W. Analyzing COVID-19 Dataset through Data Mining Tool “Orange.” In: 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM). IEEE; p. 198–203. (2021)
  36. Mohapatra S, Swarnkar T. Comparative study of different orange data mining tool-based AI techniques in image classification. In: Lecture Notes in Networks and Systems. Singapore: Springer Singapore. p. 611–20. (2021)
  37. Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. Feb 3;20(1):16.(2020)
  38. UCI machine learning repository [Internet]. [cited 2024 May 13]. Available from: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records
  39. Patrishkoff D, Hoyt R. Chapter 4 Machine Learning Models. In: No-Code Data Science. Partishkoff D, Hoyt R (eds). Pp 162–172. Lulu. Morrisville, NC. 2023
  40. Shala D. no-code software. 2023 [cited 2024 Feb 8]. What is No-Code Data Science and its Influence on the World. Available from: https://www.no-code-software.com/no-code/what-is-no-code-data-science-and-its-influence-on-the-world/amp/

--

--