RapidMiner Go — Machine Learning On The Go
Machine learning has increased in popularity in the US as evidenced by Google Trends trendline over the last five years.
The field of machine learning has blossomed in large part due to a greater variety of algorithms leading to more modeling choices, coupled with a “tsunami of data” and the increased interest in data science by all industries.
The most common application of machine learning is predictive analytics where categorical (classification) and numerical (regression) outcomes are predicted.
While machine learning is frequently processed by the programming languages R and Python, multiple machine learning software programs are now available. These programs vary from free and open source to expensive commercial solutions. In addition, machine learning software programs run the gamut from basic functionality to AutoML where the goal is to automate the entire machine learning pipeline, from data preparation to modeling, validation and performance. There are a variety of AutoML programs available today, such as Google Cloud AutoML, DataRobot, dotData, H20 Driverless AI, RapidMiner Studio, and many others.
RapidMiner Studio is a modern AutoML that combines two important automated steps — TurboPrep for data preparation, exploration and visualization and AutoModel for data modeling. In early 2020, RapidMiner released RapidMiner Go which is basically AutoModel without any data preparation or visualization. Unlike the client-based RapidMiner Studio, Go is browser-based. Go does not include unsupervised learning or outlier detection, like RapidMiner Studio. There is a free 30-day trial period and then the charge is $10 monthly with no minimum subscription.
The goal of Go is to quickly run a set of algorithms on data that has already been cleaned and explored. Either .csv or .xlsx files can be uploaded. The finished model can be downloaded and deployed.
Models available: For Classification, these are: Naive Bayes, Generalized Linear Model, Large Fast Margin, Deep Learning, Logistic Regression, Decision Tree, Random Forest, Gradient Boosted Trees, and Support Vector Machine (SVM). For Regression, these are: Generalized Linear Model, Deep learning, Decision Tree, Random Forest, Gradient Boosted Trees and Support Vector Machine (SVM)
Performance measures: For Classification algorithms, the measure are: 1. Accuracy 2. Classification errors 3. Recall 4. Precision 5. AUC 6. Gains and Costs 7. Model Building Time. 8. Confusion matrix. 9. Input Weights. 10. ROC Curve. The performance measurements for Regression are 1. R squared 2. Root Mean Squared Error 3. Average Absolute Error 4. Average Relative Error 5. Model Building Time.
The metrics for all algorithms are reported together making a comparison of performance easier. Global weights are also presented that list the attributes that correlate the most with the outcome variable. When each algorithm is explored, the specific weights of importance are presented as well as the performance measurements above. A simulator is presented for each model that permits a user to modify the predictors to see how that impacts the outcome. Some models such as logistic regression display beta coefficients and p-values, while others list outcomes with density graphs and visually displayed decision tree output.
Data Tests: I chose to analyze a well-known heart disease prediction dataset (classification). After uploading, the user first picks the column that is the outcome (heart disease presence, absence) and then chooses the binary outcome of interest presence of heart disease. The user then looks at a quality analysis to see if any of the columns should be excluded based on e.g. high input variable-input variable correlation. If for example, weight and BMI are highly correlated, then one should be excluded making the model simpler and less prone to overfitting. The next step is to choose the family of algorithms that best fits your analysis: 1. Easily Interpretable will use linear models and decision trees that are fast and easy to explain. 2. Quick Prototyping uses generic models 3. Higher Accuracy provides more algorithms but has longer run times.
I ran models using all three families of algorithms and here is what I found: 1. Easily Interpretable — Generalized Linear Model, Large Fast Margin, and Decision Tree algorithms were run with the performance output measures for classification as noted above. The model building times were 1–5 seconds. 2. Quick Prototyping — The algorithms run were Generalized Linear Model, Logistic Regression, and Naive Bayes with the same performance output measures for classification as previously noted. Model building times varied from 1.4 to 8.5 seconds 3. Higher Accuracy — all available algorithms (9) for classification were run. All default outcomes were reported, and the model building times varied from 1.2 to 20 seconds.
Below is a screenshot of AUC comparisons for all 9 algorithms. Each algorithm and its performance can be individually accessed in the left menu
With smaller datasets, it makes sense to always select the Higher Accuracy option. Unlike RapidMiner Studio there is no option to tweak algorithms (hyperparameter optimization). RapidMiner Go is all about quick modeling where you can compare the performance of multiple algorithms simultaneously. This is not AutoML where multiple steps in the data pipeline are automated. If you know how to clean, explore, and prepare data for modeling and are looking for an affordable option to perform classification and regression this is a reasonable choice.