You mention missing data but I didn't see any code to find missing values, just null values. This dataset is riddled with missing data. An insulin level of zero or triceps thickness of zero means it wasn't done as opposed to an actual value of zero. By the same token, if zero is listed for pregnancies, does that mean it was not asked or the woman has never been pregnant? Nobody understands the diabetic pedigree column. Personally, as a physician data scientist I don't use this dataset for those reasons

In the past few years, artificial intelligence (AI) has become synonymous with deep learning which is based on artificial neural networks (ANNs) with multiple layers. In healthcare, the most common use of AI is image recognition, particularly for the fields of cardiology, pathology, radiology, and ophthalmology. The most common algorithmic approach for image recognition is the convolutional neural network (CNN).

To date, creating a CNN meant you were a data scientist with advanced mathematical and programming skills and were an expert with programs such as TensorFlow and PyTorch.

It is difficult to find patient-level data of sufficient size for research, modeling, or software development. This is largely due to HIPAA concerns and the overall lack of interoperability in the US healthcare system.

Synthetic data has potential in those areas but much of the generated data is non-medical. For example, R and Python programming languages can generate non-medical datasets for supervised and unsupervised learning. Generative adversarial networks (GANs) can also generate synthetic non-medical data.

In 2017 the Mitre Corporation developed the SyntheticMass project that emulated the healthcare data of the residents of Massachusetts. There are more than 1 million…


While there are multiple excellent commercial data science platforms available (Dataiku, Databricks, DataRobot, etc.), they are expensive and not open to public collaboration. There are only a few platforms that are free or low cost and align with the Open Data and Open Science movements. The examples that come to mind are Harvard Dataverse, the Open Science Framework (OSF), and Data World. This article will discuss Data World and its many unique features.

Data World is a public benefit corporation, located in Austin Texas that launched in 2016. Data World is an online platform where participants can find data or…

In part 1 of this series, I addressed the problem of machine learning (ML) and artificial intelligence (AI) reported studies not adhering to existing evidence-based guidelines and as a result, were often felt to be of low quality. This is a significant issue given the proliferation of predictive analytical studies. For example, almost 800 predictive studies have been reported on cardiovascular disease alone. I mentioned several guidelines such as TRIPOD and CHARMS that outlined how these studies should be conducted and reported. There are also “risk of bias” guidelines, such as the Prediction model Risk Of Bias ASsessment Tool (PROBAST)

Most of us in medicine are familiar with the concept of evidence-based medicine (EBM) which is based on standards developed by several international organizations. According to Johns Hopkins Medicine

Evidence-based medicine is the integration of best research evidence with clinical expertise and patient values. Evidence-based medicine is an interdisciplinary approach which uses techniques from science, engineering, biostatistics and epidemiology, such as meta-analysis, decision analysis, risk-benefit analysis, and randomized controlled trials.”

There is a hierarchy of evidence such that we know, for example, that correlation does not equal causation and observational studies are a less robust form of evidence…

Machine learning has increased in popularity in the US as evidenced by Google Trends trendline over the last five years.

In this data-centric world we live in we need lots of tools in our data science tool kit. The kit should include expertise in spreadsheets, statistics, databases, and machine learning. Math and statistics are the backbone of data analysis, machine learning and artificial intelligence, yet they are difficult to learn and even more difficult to retain. Other obstacles for learning and using statistical packages include the cost for commercial packages, such as SPSS, SAS and Stata and the fact that many stats courses demand long-hand calculation of statistical methods.

Enter jamovi (written lower-case). This is a free open-source statistical package…

Creating a new textbook is a complex process, requiring collaboration and commitment by everyone involved. It is clearly different from writing a fictional work where you often don’t have co-authors and you don’t require citations and references. It is also quite common to have multiple tables and images in every textbook chapter, along with a table of contents, foreword, preface, and an index. For many authors, the only consideration is finding a commercial publisher to review, edit, print, and market the textbook. However, there are other options out there.

For my first textbook Health Informatics Practical Guide, I opted to…

Google Dataset Search was launched in September 2018 with the goal to create a searchable public data repository. The search engine searches on data repositories on the Web based on their meta-data and to date, it includes millions of datasets from a variety of sources. The search engine is based on that uses an open standard that organizes the metadata. Anyone can contribute datasets to this engine but they must follow the guidelines. Further details regarding contributing data can be found here.

Below is a diagram as to how the dataset search engine actually works. Using standards…


