No-Code Data Science: Part 2
Our Story
Robert Hoyt MD FACP FAMIA
David Patrishkoff MS LSSMBB
Background
In part 1 of this series, we discussed how low-code and no-code data science have become alternatives to programming. We also discussed visual programming as one of the most common categories of no-code data science.
We believe that no-code data science is a legitimate movement if you have effective software programs to take the place of programming. We have focused on Orange Data Mining because the computational engine is Python and the functionality of this platform is highly intuitive and comprehensive. We adopted a second free open-source software statistical program, known as JASP to supplement Orange in a few areas that needed additional functions.
We have used Orange and JASP to teach monthly data science clinics in the US, primarily for clinicians. Over a period of eighteen months, we have also created a textbook No Code Data Science that captures the many capabilities of these two programs. As pointed out in Part 1, Orange is much faster than programming in R or Python with similar functionality. The fact that these two programs are free and constantly updated is important and should be compared to expensive commercial programs, such as Minitab and JMP Pro.
Textbook Goals
The overall objective of the textbook is to provide data science skills without the need to learn a programming language. These skills include the following:
Engage in Hands-On Learning: Undertake data exercises using both medical and non-medical datasets featured in the book, with additional datasets available on the book's website.
Utilize Open-Source Software: Seamlessly integrate free tools like Orange, JASP, BlueSky Statistics, and more to bring data science concepts to life with a zero software budget.
Master Data Preprocessing: Become skilled at handling missing data, detecting outliers, managing duplicates, n-1 dummy encoding, navigating the complexities of highly correlated predictors, and more.
Enhance Your Exploratory Data Analysis (EDA) Skills: Elevate your exploratory data analysis capabilities through descriptive statistics, hypothesis testing, and advanced visualization techniques.
Create, Optimize, and Evaluate Predictive Models: Learn how to construct, optimize, and evaluate the performance of classification and regression machine learning models. Familiarize yourself with over twenty algorithmic approaches and options to build the best predictive models.
Conduct Image and Text Analysis: Conduct predictive analytics on images and text documents. Harness sentiment analysis, navigate word clouds, and decipher text groupings through clustering to maximize your understanding of large text documents.
The textbook is available in color as a hardcover, soft cover, and EPUB on Lulu.
Textbook Chapters
The following is a brief overview of the topics covered in this textbook. Importantly, data exercises with medical and non-medical data are offered in Chapters 1–11. The datasets are available to students on our File-Share site on the textbook website. QR codes are included in most chapters that link to popular data science videos. Each chapter has learning objectives, a summary, and key principles. Chapters also include maturity levels to give readers an indication of different levels of data science maturity. The Appendices include a Data Visualization table, a Machine Learning Algorithm table, and a Glossary.
Chapter 1 Introduction to Data Science — an overview of the field, the role of data scientists, and our choice of open-source tools and statistics programs.
Chapter 2 Data Preparation and Wrangling — discusses the importance of data preparation and exploratory data analysis which sets the stage for subsequent analytic processes.
Chapter 3 Data Visualization — explains how different plots and graphs integral to data science can enhance understanding, and provide clarity in decision-making.
Chapter 4 Machine Learning Models — discusses how to tackle modeling challenges and how to create high-performance machine learning models. Thirteen algorithms are discussed to include stacking.
Chapter 5 Model Evaluation — examines the array of metrics and methods essential to evaluating the performance of machine learning models.
Chapter 6 Supervised Learning — discusses and provides examples of classification and regression models.
Chapter 7 Unsupervised Learning — explores the importance of unsupervised learning and the pivotal role of k-means and hierarchical clustering. Principal component analysis (PCA) and Association Rules are also discussed.
Chapter 8 Time Series Forecasting and Survival Analysis — discusses how to create a time series model using ARIMA and understand survival outcomes with Kaplan-Meier analyses.
Chapter 9 Geolocation — dives into geo-mapping techniques and the art of crafting choropleth visuals.
Chapter 10 Image Analytics — discusses computer vision with deep neural networks. Seven convolutional neural networks are available.
Chapter 11 Text Mining — delves into natural language processing, highlighting the role of small and large language models.
Chapter 12 Integrating Continuous Improvement and Data Science into Industry 4.0 — this reinforces the imperative of embedding machine learning and artificial intelligence within the latest industry paradigms
Textbook Website
We created a textbook website to discuss the book but also offer additional functionality.
Workshops tab: We have recorded our workshop sessions on a YouTube channel and they are archived under the Workshop tab.
Certification tab: We will offer three-level certification courses beginning in January 2024. Details, including FAQs, are available on this webpage.
File-Share tab: Hosts the datasets we used in each chapter, plus a few additional ones. These are available to download by the public.
Resources tab: Includes 1. Data science cheat sheets 2. A table on those metrics that should be used for imbalanced data 3. A master Orange workflow that includes 23 algorithms that are downloadable. 4. A link to a PDF that includes multiple useful datasets not included in the File-Share option.
Calculators tab: This Omni calculator is based on the confusion matrix. Input the true positive (TP), true negative (TN), false positive (FP), and false negative (FP) results, and 10 performance metrics are generated.
Future Directions
In 2024 we will begin our data science certification courses using this textbook. The three levels are Fundamentals, Practitioner, and Advanced Practitioner. The higher level courses will be longer with more assignments and longer exams.
We will continue to integrate AI in our teaching as large language models have been shown to be both good coding assistants and data science tutors. Because this is such a rapidly changing field, we will discuss the newest and most innovative models.
Conclusions
No-Code Data Science is a unique textbook because it is the first to teach comprehensive, yet practical, data science without the need to program. The open-source software we selected is more intuitive than programming, faster, and free. These facts should help adoption by more faculty and should result in the “democratization” of data science by more people in more fields.
Our certification courses will teach data science using this textbook and provide documentation that the course material was learned.