Biologically Informed Feature Selection in Large Scale Genomics
MetadataShow full item record
Predictive genetics is a promising field of research, particularly in medical science where the ability to identify disease or treatment response could provide novel methods of mitigating their negative effects. Machine learning represents the most obvious tool that can be used to this end, however a notable property of genetic data that proves difficult for machine learning is a significant imbalance between samples and features, indicating the need for feature selection. The dataset we used was collected from multiple international centres and includes subjects with bipolar disorder, some of whom respond to the drug lithium and some who do not. We first select the features that were measured jointly by each data collection centre and show that above chance classification is possible with these data, despite significant overfitting which indicated the need for further feature space reduction. We then introduce a novel method capable of reducing the number of features even further so as to be bounded by the number of subjects. This method uses the hierarchical structure of genetic data to select feature subsets and evaluate their fitness individually before including the best ones in the final feature set. We show that our method improves on the first method while maintaining biological interpretability.