I enjoyed reading your series of articles on Orange.

Sep 2, 2021

I enjoyed reading your series of articles on Orange. My question is how to reduce data leakage. It is clear how to split the data into train/test with Data Sampler but what if you want to impute, normalize the data or pick the top 5 relevant predictors? I get the impression that you connect the Preprocess widget directly to Test and Score. It also receives the train and test data from the Data Sampler widget. Do you believe this is the correct workflow to minimize data leakage? Thanks

Written by Robert (Bob) Hoyt MD FACP ABPM-CI FAMIA