Maximizing Orange for Data Science Education — Part 1

What is Orange?

Orange data mining platform is a free educational data science platform created by computational biologists at the University of Ljubljana, Slovenia. The computation engine in the background is Python, although users don’t see nor need programming to execute functions. It’s fast, intuitive, and comprehensive. It is client-based and works on Windows OS, MacOS and Linux. Like KNIME and RapidMiner Orange is an example of visual programming where you connect widgets, and that completes a function. The following image displays the expandable menus on the left and the connected widgets on the right in the workflow area.

Orange Features

Orange is excellent for exploratory data analysis consisting of basic descriptive statistics and visualization. It also excels in supervised and unsupervised learning. Additionally, Orange has multiple widgets for image analytics and text mining. Other add-ons (options) include spectroscopy, geo-location, time series, association rules, cellular biology, bioinformatics, and survival analysis.

There are 38 educational YouTube videos associated with Orange and a widget catalog that describes widget input, output, and optimization, as well as an example of how to use the widget. There have been several articles on Medium.com that cover Orange and here is a link to a 2-part series by Manthan Bhikadiya that is very helpful.

Orange is used by hundreds of universities worldwide for data science education, but it remains unknown to many. While some may view Orange as an introductory data science platform, it has been used as the computation engine in multiple published articles.

Users can upload their own data files using the File widget, upload data from a URL, download data from a connection to a database using the SQL widget, or use the 65 datasets already available in the datasets widget.

My focus is on teaching data science to healthcare workers, but the content included in this article is also relevant for those not in healthcare. The goal is not to make users data scientists but instead to provide them with the tools and knowledge, so they are conversant and knowledgeable in the more important areas of data science, such as exploratory data analysis, machine learning, and artificial intelligence.

The following sections will discuss the basic Orange functions, and part 2 will cover some of the less well-known functions that make Orange excel compared to other data science programs.

Exploratory Data Analysis

The usual approach to a new dataset is to first prepare and clean the data. This would include identifying missing data, outliers, incorrect labels, etc. In addition, it is often necessary to transform data, and in Orange, there is a wide variety of widgets for transformation as seen in the table below. The pivot table function is much easier than what is available in Excel and the Merge widget combines datasets (CSV files) much faster and easier than a SQL join command.

The feature statistics widget displays the mean, median, dispersion, min, max, and missing data as shown below. This analysis is based on the heart disease prediction dataset where red means the presence of heart disease and blue means the absence.

Data Visualization

To visualize continuous data, you have available box plots and histograms, and bar plots and pie charts for categorical data. There are also violin plots, mosaic plots, sieve plots, distribution, and scatter plots. Below are examples of box plots and bar plots. Note that the box plot provides t-test results for statistical significance and the bar plot provides chi-square results.

The distribution widget displays histograms and kernel density of numerical data as seen below. This is a histogram of the serum total cholesterol, with a bin width = 50. Note the outliers to the right

Supervised learning

Supervised learning is divided into classification where the output is categorical (e.g., heart disease/no heart disease) and regression where the output you are trying to predict is continuous/numerical (price, miles per gallon, etc.) Orange is excellent for creating classification and regression models. It has an extensive algorithm library as noted in the table below.

Creating a classification model with multiple algorithms for training and testing data is straightforward as seen below. Data in the file widget is fed into the data sampler widget where training data = 70% and test data = 30%. Three learners were used: logistic regression, random forest, and Naive Bayes. We will display the results of Test and Score as the second image and note that the widget is connected to a confusion matrix, ROC curve, and calibration widgets. Observe how fast the training and test times are and the variety of metrics to evaluate the model. Lastly, in the upper left cross-validation of the training data is checked (k =10). To compare with test results, just check the box in the lower left “test on test data.” This makes it easy to toggle between training and test results. Most data science platforms are not this user-friendly for modeling.

Regression is also straightforward. We will use the same heart disease prediction dataset but change our outcome to the millimeters of ST depression found on the ECGs during a stress test. We will connect the file widget to the test and score widget then connect the linear regression and random forest widgets to test and score. The workflow is below, followed by a screenshot of regression performance.

We can see that linear regression has a slightly higher R2 and lower RMSE suggesting that it performed better than random forest.

Unsupervised learning

Unsupervised learning is used to find clusters or new groups in a dataset. In other words, you don’t have an outcome you are trying to predict as in supervised learning. This approach has been particularly important in genomics to find new groups or clusters. Orange provides several algorithms such as k-means clustering, hierarchical clustering, DBSCAN, and t-SNE. Below is an example of hierarchical clustering on a diabetes-related dataset. Three groups are identified and overall are intuitive.

Image Analytics

Orange offers several convolutional neural networks (CNNs) with the workhorse being Google Inception v3. The primary function of image analytics is image classification. In other words, how well can Orange classify chest Xrays, dermatological images, ECGs, and so forth? For example, how accurate is it in classifying a normal chest Xray vs one showing pneumonia? The image below shows the workflow to analyze images of malignant melanoma and another common skin cancer basal cell carcinoma/epithelioma. The results were impressive in classifying the two skin cancers accurately.

Text Mining

Orange can perform several text mining functions such as creating a word cloud, classifying text, and sentiment analysis. It has widgets that connect to PubMed, Twitter, The Guardian, The New York Times, and Wikipedia. Below is a word cloud for monkeypox created from analyzing PubMed abstracts and after that the workflow to mine Tweeter feeds for sentiment about vaccines is displayed.

Conclusions

The Orange data mining platform is one of the most intuitive programs available to teach data science that does not require programming or higher math. While it would be ideal for everyone to learn Python or R as well as linear algebra and calculus it is not realistic for the masses. Orange is an example of the “democratization of data science”, and it facilitates more people becoming “citizen data scientists.” These individuals can assist the data scientist by having a better understanding of the data (data literacy) and by conducting preliminary analyses they can share with the data scientist.

Stay tuned for part 2 where we present a plethora of unique Orange features to make data science fast, intuitive, interesting, and productive.

--

--