It’s about time a web database system was designed for usability

Finding a relational database for business or education sounds easy but, in many instances, it is not. There may be connection, configuration, or installation challenges creating a new database. Charges for use may be difficult to understand. …

I enjoyed reading your series of articles on Orange. My question is how to reduce data leakage. It is clear how to split the data into train/test with Data Sampler but what if you want to impute, normalize the data or pick the top 5 relevant predictors? I get the impression that you connect the Preprocess widget directly to Test and Score. It also receives the train and test data from the Data Sampler widget. Do you believe this is the correct workflow to minimize data leakage? Thanks

You mention missing data but I didn't see any code to find missing values, just null values. This dataset is riddled with missing data. An insulin level of zero or triceps thickness of zero means it wasn't done as opposed to an actual value of zero. By the same token, if zero is listed for pregnancies, does that mean it was not asked or the woman has never been pregnant? Nobody understands the diabetic pedigree column. Personally, as a physician data scientist I don't use this dataset for those reasons

It is difficult to find patient-level data of sufficient size for research, modeling, or software development. This is largely due to HIPAA concerns and the overall lack of interoperability in the US healthcare system.

Synthetic data has potential in those areas but much of the generated data is non-medical. For…


Dr. Hoyt is a physician data scientist who is also an author and editor of several books. His most recent textbook is Data Preparation and Exploration.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store