I enjoyed reading your series of articles on Orange. My question is how to reduce data leakage. It is clear how to split the data into train/test with Data Sampler but what if you want to impute, normalize the data or pick the top 5 relevant predictors? I get the impression that you connect the Preprocess widget directly to Test and Score. It also receives the train and test data from the Data Sampler widget. Do you believe this is the correct workflow to minimize data leakage? Thanks

--

--

You mention missing data but I didn't see any code to find missing values, just null values. This dataset is riddled with missing data. An insulin level of zero or triceps thickness of zero means it wasn't done as opposed to an actual value of zero. By the same token, if zero is listed for pregnancies, does that mean it was not asked or the woman has never been pregnant? Nobody understands the diabetic pedigree column. Personally, as a physician data scientist I don't use this dataset for those reasons

--

--

Robert Hoyt MD FACP ABPM-CI FAMIA

Robert Hoyt MD FACP ABPM-CI FAMIA

Dr. Hoyt is a physician data scientist who is also an author and editor of several textbooks found at https://www.informaticseducation.org.