CROSS-VALIDATION ADJUSTMENT FOR MODEL SELECTION WITH CORRELATED DATA
Abstract
In the context of general linear models, often techniques are used with
an independence assumption. Unfortunately, this assumption often does not
hold in real data. Real data tends to have correlations in the errors which can
take a variety of structures in the form of a covariance/correlation matrices.
In this research we are primarily focused on the blocked correlation structures,
phylogenetic tree structures. These correlation matrices arise in hierarchical
models and come from phylogenetic modelling of trait evolution. Our research
proposes an adjustment to cross-validation in the case of correlated data. We
will produce a variety of candidate models and test how well our techniques
do at selecting the true model from the set of candidate models. This research
is focused on cross-validation techniques for model selection. Cross-validation
techniques are focused on re-sampling data over K number of folds into training
and testing samples. Historically, methods such as cross-validation account for
the dependent data by transforming the data after the splitting of training and
testing data. Our research looks at transforming our data with a square-root
inverse covariance (V^−1/2) matrix transformation that is applied prior to the
sampling. We calculate a measure known as Expected Predictive Log Density
(EPLD) and it is used to measure predictive accuracy across the folds. The
loss function is applied on a variety of models. In the research we show the
relationship between EPLD and square error loss, and argue that SSE can be
used as the selection criterion for blocked models.