Generalizing the Linear Step-up Procedure for False Discovery Rate Control with Applications to Setwise and High Dimensional Variable Selection

Organ, Sarah

Generalizing the Linear Step-up Procedure for False Discovery Rate Control with Applications to Setwise and High Dimensional Variable Selection

Files

SarahOrgan2025.pdf (1.35 MB)

Date

2025-12-15

Authors

Organ, Sarah

Abstract

This thesis presents a unified framework for false discovery rate (FDR) controlled variable selection that addresses three major challenges: (1) the inability of traditional FDR procedures to accommodate structured, non-independent hypothesis testing, (2) the failure of standard FDR control procedures for variable selection in the presence of strong dependence among predictor variables, and (3) the lack of methodology for obtaining valid p-values for applying FDR control in the high-dimensional data where the number of variables exceeds the number of samples (m > n). To address the first challenge, we generalize the linear step-up procedure by applying a sizing function to account for dependence structures among hypotheses. This involves a theoretical extension of the linear step-up framework to structured collections of dependent hypotheses, ensuring that FDR control is preserved under dependence when applied in structured hypothesis settings. We provide theoretical guarantees of this generalization for FDR control. These generalized procedures form the foundation for both SHRED and HVS methods we later describe. To address the second challenge, we propose a family of methods for setwise variable selection that extend the classical FDR framework to non-independent hypothesis spaces. We first introduce the Setwise Hierarchical Rate of Erroneous Discovery (SHRED) methods, which performs FDR-controlled variable selection over hierarchical trees of hypotheses. For these methods, we expand on the notion of variable selection such that we allow for selecting sets of highly correlated surrogate variables, under the assumption that at least one variable in the set is a true variable. We then introduce the Hypergraph Variable Selection (HVS) method, which enables setwise variable selection over a hypergraph of hypotheses representing complex dependencies among variables. In simulation studies we show that both the SHRED and HVS methods have significant advantages over current FDR control methods for variable selection when there is correlation present among predictor variables. Finally, to overcome the third challenge, the lack of methodology for obtaining valid p-values in high-dimensional settings, we introduce a multi-step p-value estimation procedure that enables the application of linear step-up methods for FDR control when m > n. This procedure begins with a data reduction step to identify a subset of variables believed to contain all true signals, followed by a conservative re-fitting step that yields valid p-values for inference. We provide theoretical guarantees showing that this approach preserves FDR control, as well as simulation studies highlighting the advantages of this method over classic high dimensional variable selection methods for FDR control. Collectively, these contributions advance the theory and practice of FDR-controlled variable selection, offering a flexible and principled framework that accommodates structure, dependence, and high dimensionality, three fundamental challenges of variable selection with FDR control.

Description

This thesis presents a unified framework for false discovery rate (FDR) controlled variable selection that addresses three major challenges: (1) the inability of traditional FDR procedures to accommodate structured, non-independent hypothesis testing, (2) the failure of standard FDR control procedures for variable selection in the presence of strong dependence among predictor variables, and (3) the lack of methodology for obtaining valid p-values for applying FDR control in the high-dimensional data where the number of variables exceeds the number of samples (m > n). To address the first challenge, we generalize the linear step-up procedure by applying a sizing function to account for dependence structures among hypotheses. This involves a theoretical extension of the linear step-up framework to structured collections of dependent hypotheses, ensuring that FDR control is preserved under dependence when applied in structured hypothesis settings. We provide theoretical guarantees of this generalization for FDR control. These generalized procedures form the foundation for both SHRED and HVS methods we later describe. To address the second challenge, we propose a family of methods for setwise variable selection that extend the classical FDR framework to non-independent hypothesis spaces. We first introduce the Setwise Hierarchical Rate of Erroneous Discovery (SHRED) methods, which performs FDR-controlled variable selection over hierarchical trees of hypotheses. For these methods, we expand on the notion of variable selection such that we allow for selecting sets of highly correlated surrogate variables, under the assumption that at least one variable in the set is a true variable. We then introduce the Hypergraph Variable Selection (HVS) method, which enables setwise variable selection over a hypergraph of hypotheses representing complex dependencies among variables. In simulation studies we show that both the SHRED and HVS methods have significant advantages over current FDR control methods for variable selection when there is correlation present among predictor variables. Finally, to overcome the third challenge, the lack of methodology for obtaining valid p-values in high-dimensional settings, we introduce a multi-step p-value estimation procedure that enables the application of linear step-up methods for FDR control when m > n. This procedure begins with a data reduction step to identify a subset of variables believed to contain all true signals, followed by a conservative re-fitting step that yields valid p-values for inference. We provide theoretical guarantees showing that this approach preserves FDR control, as well as simulation studies highlighting the advantages of this method over classic high dimensional variable selection methods for FDR control. Collectively, these contributions advance the theory and practice of FDR-controlled variable selection, offering a flexible and principled framework that accommodates structure, dependence, and high dimensionality, three fundamental challenges of variable selection with FDR control.

Keywords

FDR Control, Variable Selection, High dimensional variable selection, High correlation, Linear step-up procedures

URI

https://hdl.handle.net/10222/85568

Collections

Faculty of Graduate Studies Online Theses

Full item page

Generalizing the Linear Step-up Procedure for False Discovery Rate Control with Applications to Setwise and High Dimensional Variable Selection

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections