Beyond Static Models: Boost Your Results with Dynamic Optimization in Orange
Introduction
Creating a machine learning model involves multiple, time-consuming steps. We know that approximately seventy-five percent of a data scientist’s time in modeling involves data pre-processing and exploratory data analysis. [1] What may not be emphasized enough is the need to optimize models by adjusting algorithms, known as hyperparameter optimization or HPO and it is also time-consuming. HPO is critical to maximize the performance of a classification or regression model for any domain or a Kaggle competition. [2] When a programming language is used HPO is accomplished by random and grid search. [3]
What is an efficient way to perform HPO when the analyst is not using a programming language, such as R or Python? Several software programs use visual programming, such as Orange, KNIME, and RapidMiner. Orange is a free open-source data mining software program that can handle all processes in the machine learning pipeline. [4] Orange has multiple unique educational properties that facilitate student learning. HPO is no exception, as Orange can dynamically display changes to model performance when hyperparameters are adjusted. We call this “dynamic model optimization” or DMO. While it still involves some trial and error, you can see immediate results in model performance. Instead of changing e.g., algorithm regularization and going back and forth between windows, you can tweak the algorithm with all windows displayed and see the changes in performance in real-time.
Regression Models
Below is a screenshot of the workflow in Orange to create a linear regression model using the Linear Regression, Random Forest, and Xgboost widgets to predict housing costs. The dataset used is the Kaggle King County housing dataset [5]. The Data Sampler widget splits the data into 70/30 train/test. Model performance on test data is displayed in the Test and Score widget. To demonstrate DMO, we felt videos captured the interactive nature better than screenshots. The first video will demonstrate a regression model performance affected by: no regularization, LASSO, Ridge, and Elastic Net regularization in the Linear Regression widget.
Video #1 https://drive.google.com/file/d/1D8eD4aGA5nuYvGy08WvKmemsuk75OOYZ/view?usp=drive_link
We can also visualize what happens to the residual plot by adjusting hyperparameters in the three algorithms. The goal is to achieve the highest correlation between the actual result and the prediction, indicating “goodness of fit.” [6] This is accomplished by connecting a Scatterplot widget to the Test and Score widget and setting Price for the y-axis and Gradient Boosting for the x-axis. We are comparing the actual value (Price) with the predicted value (Gradient Boosting). The Scatterplot will generate an r-value that is also dynamic. The video below demonstrates that regularization can be adjusted to maximize the r-value.
Video #2
https://drive.google.com/file/d/1mIeIPlRr-M0HbPfxJISfHQQ0CuBUVcBA/view?usp=drive_link
Classification Models
Below is a screenshot to display the workflow in Orange to create a classification model to predict hepatitis mortality using Logistic Regression, Random Forest, and Xgboost. [7] The data is split 70/30 into train/test but tested on training data due to the small data points in the test data. There is a significant class imbalance associated with this dataset.
The video demonstrates how the model performance, confusion matrix, calibration plot, and ROC curve dynamically shift with DMO.
Video #3 https://drive.google.com/file/d/1_U1TdVK7L8MooJA4aHrQQeZmgqCDAQOT/view?usp=drive_link
Other Uses for DMO
There are other interactive widgets that dynamically display HPO. Those are the Precision Recall Curve (PRC) widget, the Nomogram widget, the Rank widget, and the Correlation widget. With the class imbalance observed in the hepatitis dataset, maximizing the area under the precision-recall curve (AUPRC) is important. Changing regularization options in Logistic Regression (e.g., Ridge and LASSO) dynamically changes the scores in the Rank widget and the weight of predictors in the Nomogram widget. Changing regularization in the regression model scenario could change correlations, as noted in the correlation widget.
Conclusions
Dynamic model optimization is not only a method to maximize model performance, it is an interactive tool for students to better understand HPO and other data science processes. Using Orange, multiple algorithms can be tweaked and the interactive results displayed. Many software programs only process one algorithm at a time, so this software program saves time compared to standard programming and other visual programming platforms.
[1] Press G. Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. Forbes, March 23, 2016. Available from: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=3ce865b36f63
[2] Bhutani S. Top 25 Quotes from ML Heroes Interviews [Internet]. 2019 [cited 2024 Feb 19]. Available from: https://hackernoon.com/chai-time-data-science-show-announcement-sd8p384t
[3] Selvaraj N. KDnuggets. [cited 2024 Mar 8]. Hyperparameter tuning using grid search and random search in Python. Available from: https://www.kdnuggets.com/2022/10/hyperparameter-tuning-grid-search-random-search-python.html
[4] Bioinformatics Laboratory, University of Ljubljana. Orange Data Mining. [cited 2023 Oct 28]. Orange data mining. Available from: http://orangedatamining.com
[5] House Sales in King County, USA [Internet]. 2016 [cited 2024 Mar 11]. Available from: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
[6] Work WB at. Interpreting residual plots to improve your regression [Internet]. 2017 [cited 2024 Mar 11]. Available from: https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
[7] Hepatitis mortality prediction [Internet]. Kaggle; 2020 [cited 2024 Mar 11]. Available from:
https://www.kaggle.com/code/sm261998/hepatitis-mortality-prediction