Maximizing Orange for Data Science Education — Part 2

In part 1 of this series, I provided an overview of the data mining platform Orange which focuses on data science education. In part 2 I will focus on some unique Orange educational features users should find helpful. Most of these features are not discussed on the Orange website nor in the YouTube videos.

1. Expand the model performance metrics. For classification, in the test and score widget, you are provided with AUC, classification accuracy (CA), F1 score, recall (sensitivity), and precision. Right-click “Model” and you can add the training and test times, specificity, and LogLoss (how close the predictions are to the actual results).

2. Sharing information

a. Your Orange workflow can be saved as an .ows file to Documents for future use or to share with other students or colleagues. Go to File >> Save.

b. Save data widget will save your data after you have made changes, so you now have a new CSV file to work with. For example, perhaps you have filtered out all male patients and want to keep the file of just females.

c. To save plots go to the lower left of the open widget and select “save image”. It can now be saved as a .png, .svg or a PDF file. In the same area, you can save the plot with notes when you select “Report” and also print it. See the following screenshot showing the report option.

3. Maximize the use of the data table widget. This widget is like a simple spreadsheet so you can see the data in a tabular format. It does tell you the number of columns and rows and whether there is any missing data and if you have any target values. It’s important to add this widget anytime you modify or transform your data to confirm the change took place. You can sort data by clicking on the header. In the image below I clicked on age and this sorted the column in ascending order. I then highlighted the first row and using the shift key selected all patients under age 40. You can now connect this data table with its 15 highlighted patients to another data table or to visualization widgets for further analyses. The blue below indicates that the first 15 rows have been selected.

I connected this data table widget to a box plot widget that also generates bar plots and you can see that 11 of these patients had a normal thallium test, whereas 4 had a reversible defect.

4. Select and connect. In the previous section, I pointed out that you could highlight data in the data table for a further focused review. You can do the same thing with multiple other widgets — highlight a section or datapoint and connect this to a data table with only the data you have selected. You can select a node in a decision tree, any category in a confusion matrix, a data point in a scatter plot, a category in a box plot or histogram, or a section of the mosaic plot. In the screenshot below I highlighted the 23 false negative patients in the confusion matrix so I can analyze them separately. I will connect this to a data table that will only display the 23 patients.

5. Select rows widget. If you could select certain rows of a dataset that would mean you could filter the data. In the example below I have selected females who had a reversible defect on thallium stress testing but had a normal cardiac catheterization. There were only 2 patients identified. Twenty-three males were identified when the gender was changed. The select rows widget can now be connected to a data table or visualization widget for further analysis. Additionally, you can connect this widget to two data tables and right-click the connectors to be sure one says matching data and the other says unmatched data. So one data table will have the two women and the other data table will include everyone else.

6. Extensive algorithm library. In part 1 I displayed a table with the algorithms available in Orange. Most are self-explanatory but let me highlight two that are less well known. The constant widget is a baseline algorithm that bases its prediction solely on frequency. Let’s say we have 5% patients with cancer and 95% with no cancer. The constant widget will show an accuracy of 95% simply based on the majority class. This is an example of class imbalance where there are many more patients without cancer, compared to those with cancer. This creates multiple problems that we won’t cover in this series but suffice it to say that any algorithm you test will have to be better than this baseline result of 95%. What is happening is that the machine learning algorithm is learning on the majority class of “no cancer” patients.

The stacking algorithm combines multiple algorithms to improve performance, and this creates an ensemble. Other examples of ensembles are random forest and gradient boosting trees where multiple trees are combined to improve performance. In the screenshot below the stacking widget is used to combine the benefits of logistic regression, random forest, and Naive Bayes to improve performance. In fact, the AUC and F1 scores did improve compared to logistic regression alone.

7. Add weights to the minority class. As indicated in the previous section imbalanced datasets are common and present a challenge when the minority class of interest such as fraud is greatly outnumbered by the majority class of no fraud. One simple solution in Orange is to give more weight to the minority class. This can be done in the logistic regression and random forest widgets by opening the widget and checking the box “balance class distribution.” This will improve the classification model performance

8. Let Orange help you visualize data. The scatter plot widget has a button “Find Informatics Projections” so instead of you having to manually input multiple combinations for the x and y axes this function will pull up significant associations. Below is an example where the search turned up a scatter plot of age vs systolic blood pressure. Note the rise with aging and the associated correlation r value.

9. Add a nomogram. Orange provides a customizable and interactive nomogram that you can use with logistic regression and Naive Bayes to see how the prediction probabilities change after changing the predictors. Connect the file widget to logistic regression and connect the latter to the Nomogram widget. Select the target = 1. The default baseline probability is 40% for colored (calcified coronary arteries) but when you slide the blue icon to the right to 3 (indicating 3 calcified arteries) the probability goes to 91%. The most important predictor comes first, and all are customizable.

10. Educational plots.

a. The Receiver Operator Curve (ROC). When you mouse over the data points Orange will tell you the threshold for that point. Remember that the ROC curve is created by plotting the true positive rate against the false positive rate at multiple thresholds. The default threshold is 0.5 but it can be increased or decreased if you need to increase sensitivity or specificity. The screenshot below shows the results for different thresholds. At the higher threshold, recall (sensitivity) and false positives decrease, and specificity increases. The opposite is true at the lower thresholds.

b. Within the calibration plot are other plot choices and arguably the two most important ones are those that compare sensitivity with specificity and precision with recall (sensitivity). The vertical slider can be moved to see what happens with different thresholds. The screenshots below display sensitivity vs specificity and below that displays precision vs recall. Moving the slider changes the result and is an excellent interactive tool.

Conclusions

This is a brief listing of some of the Orange bells and whistles that will make your data science journey more productive and enjoyable. In the future, I will use Orange to solve specific problems without the need for programming or higher math.

--

--