Repository logo

LATENT STRUCTURE IDENTIFICATION AND PERSONALIZED VARIABLE SELECTION

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The identification of latent structures and the selection of personalized variables are critical to enhance the interpretability of models, predictive performance, and decision-making efficiency in complex data environments. There are three parts in this thesis. In the first part, we focus on the development of a method for identifying the latent spatial patterns for a response variable in spatial data. We propose a new method that calculates the similarity scores between different locations with a supervised random forest model to effectively capture spatial dependencies of a response variable. The similarity score is derived from the proportion of trees in which two locations fall in the same terminal node for the same values of other predictors. This resulting similarity matrix is then used to derive eigen-scores and spatial clusters, which significantly improve the performance of models such as XGBoost, GWR, and random forest in both simulations and real datasets. In the second part, we develop an effective neural network pruning method based on backwards LASSO selection that can simultaneously select features and structure. We show that the LASSO shrinkage problem in neural networks can be re-written as a standard weighted regression or classification problem with LASSO penalty. Our proposed method starts from a dense neural network which contains all structures without feedback, and prunes links to select the optimal sparse neural network structure. The results of this structure selection highlight the inadequacy of commonly-used feedforward structures. By examining the selected structure, we are able to gain insight into the linear or nonlinear properties of the estimated function, and thus better interpret the underlying function. Finally, personalized variable selection is a novel topic to address an important problem. In many real-world applications, some variables may be costly or difficult to obtain. For example, in healthcare, ordering excessive medical tests can lead to unnecessary expenses, long waiting times, and patient discomfort. In the personalized variable selection paradigm, we consider the problem of using a fitted model to make predictions for a new observation where we have not yet measured all these costly variables. We assess the predictive value of the potentially useful predictor variables for this new observation, in order to decide which predictors are worth measuring for this observation. We introduce a novel metric called the Expected Loss Improvement Estimate (ELIE), which quantifies the expected gain in predictive accuracy from measuring a missing variable. The core idea of our method is that large ELIE suggests greater variability in predictions, indicating that collecting the true values of the missing variables is highly valuable for those data points. This approach can help us determine when imputation is sufficient and when additional data collection is necessary to maximize model performance.

Description

Keywords

Spatial Clustering, Similarity Matrix, Random Forest, Neural Network, Neural Network Architecture, Lasso, Personalized Variable Selection, Expected Loss Improvement Estimate, Multivariate Random Forest

Citation