EXPLORATION OF MULTIVARIATE CHEMICAL DATA IN NOISY ENVIRONMENTS: NEW ALGORITHMS AND SIMULATION METHOD
Date
2019-12-13T13:33:17Z
Authors
Driscoll, Stephen
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With high-dimensional measurements becoming increasingly common in chemistry, the efficient extraction of meaningful information from chemical data has never been more important. Chemometrics, a sub-discipline of analytical chemistry, emerged from the need for more advanced multivariate data analysis methods capable of solving more complex chemical problems. The goal of chemometrics can simply be stated as the differentiation between chemical variance and the variance due to measurement error. All analytical measurements are subject to errors, sometimes called noise, that contribute uncertainty to any type of analysis. The current state of the literature lacks both realistic noise simulation in the evaluation of new algorithms, as well as approachable methods to perform such noise simulation. Chapters 2, 3, and 4 of this thesis address these shortcomings. Chapters 2 and 3 describe a simple method for simulating realistic analytical measurement errors while Chapter 4 describes a method for accommodating different error structures in the analysis of fused multivariate data, an advance that circumvents the need for complicated preprocessing of these increasingly common data structures. Although many advances have been made in developing new algorithms that provide meaningful results when exploring modern chemical data sets, variance-based methods, such as principal component analysis (PCA), still dominate the field. A promising alternative algorithm that is not based on variance is projection pursuit analysis (PPA). However, due to the nature of the ordinary PPA algorithm, it requires the use of PCA when there are many response variables with respect to samples, which is the case in most multivariate chemical data sets. Chapter 5 and 6 address this issue by proposing a sparse PPA algorithm that is independent of PCA and is shown to reveal meaningful results where PCA and ordinary PPA cannot. Another issue with ordinary PPA is that it performs poorly when applied to unbalanced data sets or data sets with a number of classes not equal to a power of 2. Chapter 7 addresses this issue by implementing an augmentation strategy that allows for the analysis of unbalanced data and the sequential extraction of clusters with projection pursuit
Description
Keywords
chemometrics, multivariate statistics, analytical chemistry