Fiche publication
Date publication
novembre 2015
Journal
BMC medical research methodology
Auteurs
Membres identifiés du Cancéropôle Est :
Mme TRUNTZER Caroline
Tous les auteurs :
Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL
Lien Pubmed
Résumé
In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.
Mots clés
Algorithms, Data Interpretation, Statistical, Humans, Oligonucleotide Array Sequence Analysis, Principal Component Analysis, Regression Analysis, Selection Bias
Référence
BMC Med Res Methodol. 2015 Nov;15:95