A NEW METHOD FOR MULTI-CLASS CLASSIFICATION WITH MULTIPLE DATA SOURCES, WITH APPLICATION TO ABDOMINAL PAIN DIAGNOSIS
Date
2022-07-21T14:17:37Z
Authors
Ling, Shen
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this thesis, we deal with two extremely challenging issues that arise in a medical diagnosis problem. Namely multi-class classification and integration of data from multiple sources. Both are issues that arise in a wide variety of data analysis problems. We present simple but effective methods for dealing with these issues that significantly improve performance in an abdominal pain emergency diagnosis problem, and are widely applicable wherever these issues arise.
For integrating data from multiple sources, such as various medical tests that might be ordered for a patient, our method involves fitting separate predictors on the different sources of data, then performing a linear combination of these predictors. We show that in common cases, this method performs asymptotically better than analysing a single source of data. We also show that the method performs well compared to the popular multiple imputation approach. This very straightforward approach is applicable to a wide range of problems.
For the multi-class classification, we develop a hierarchical tree clustering of the diagnosis, thus reducing the multiclass classification to a series of binary classifications. The hierarchical tree is created using a mixture of data-driven methods based on posterior predictive probability and expert knowledge. We use a statistical learning method to combine the outputs of the binary classifications into an overall output. We find that this works better than multiplying the probabilities from the binary classifiers, which can be misled by the conditional classifiers whose conditions are not met.
Description
The thesis is concerned with classification problems with multi-modalities of data sources and their application to medical diagnosis. The challenging issue is how to handle the block missing data inherent in medical data, which occurs very commonly in many real-world problems, such as medical data.
In this thesis, the author introduces a novel approach to tackle the block missing data, which is a linear combination of the full model and partial method. The author also derives the theoretical combination factor and some very interesting results in linear and logistic regression settings; and employs the cross-validation method in more general settings.
The author also proposes the hierarchical tree structure to deal with multiclass classification problems. The author shows the proposed approach does indeed improve upon the method solely based on the full model. Extensive simulation studies have shown the advantages of adopting the combination approach over the traditional imputation method. The author has applied their method to real medical data to good effect.
Overall, the thesis studies an important problem occurring commonly in real-life data and come up with some novel and effective combination approach to solve it. The results are of theoretical interest and of practical significance. I recommend the thesis proceed to defence.
Keywords
block missing, the model combination method, hierachical tree structure