Show simple item record

dc.contributor.authorLIU, LIHUI
dc.date.accessioned2023-12-11T15:34:53Z
dc.date.available2023-12-11T15:34:53Z
dc.date.issued2023-12-07
dc.identifier.urihttp://hdl.handle.net/10222/83207
dc.description.abstractTraditional statistical methods face lots of challenges in model fitting, variable selection, and model diagnosis when analysing high-dimensional data. LASSO is one of the most popular regularised approaches for high-dimensional data such as gene expression in microbiome research. However, it often selects a large number of noise variables and it does not provide a direct quantitative assessment of the significance of each variable selected. We present a new variable selection method Subsampling Ranking Forward selection (SuRF) based on penalised regression, subsampling, and forward-selection methods. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. The default standardisation used in LASSO regression is effective for the normal predictors, but not for predictors from heavy-tailed distributions. We presented a large scale of simulations showing that heavy-tailed predictors have a large impact on variable selection and prediction in Binomial and Poisson regression, and a less pronounced effect in Gaussian regression. This can cause the model to underselect the true predictors from heavy-tailed distributions such as log normal and Pareto distributions, and to overselect those variables in Poisson regression. SuRF is less influenced by the distribution of the predictors. A Box-Cox transformation generally improves the selection rate of the heavy-tailed predictors for both SuRF and Stability Selection in Binomial regression, but it can cause a diverse effect in Poisson regression. Generalised additive models (GAMs), a type of non-parametric additive model, are a natural choice to extend SuRF to select predictors with a non-linear relation to the response. Replacing GLMs with GAMs is necessary in both the ranking and the forward-selection steps of SuRF. SuRFgam demonstrates a superior performance in both nonlinear variable selection and the prediction accuracy. It is particularly effective in reducing the noise variables, making it a better choice in various modelling scenarios.en_US
dc.language.isoenen_US
dc.subjectVariable selectionen_US
dc.subjectLassoen_US
dc.subjectForward selectionen_US
dc.subjectGLM modelsen_US
dc.subjectVariable rankingen_US
dc.subjectMicrobiomeen_US
dc.titleVARIABLE SELECTION BY SUBSAMPLING RANKING FORWARD SELECTION (SURF)en_US
dc.date.defence2023-11-22
dc.contributor.departmentDepartment of Mathematics & Statistics - Statistics Divisionen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerDr. Zeny Fengen_US
dc.contributor.thesis-readerDr. Bruce Smithen_US
dc.contributor.thesis-readerDr. Lam Hoen_US
dc.contributor.thesis-supervisorDr. Hong Guen_US
dc.contributor.thesis-supervisorDr. Toby Kenneyen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record