Improved Projection Methods for Exploratory Data Analysis in Chemistry
Date
2012-08-24
Authors
Hou, Siyuan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the rapid development of modern instruments, chemical data have become more complex in both volume and structure, which imposes more demanding requirements for advanced data analysis tools. As a highly interfacial subject, chemometrics plays an important role in the extraction of information from chemical data. One of the applications of chemometrics is in exploratory data analysis, which aims to reveal structures present in the data prior to or in place of the formal testing of a hypothesis.
Among the different methods for exploratory data analysis, principal component analysis (PCA) may be the one most widely used in chemistry. When PCA is viewed as a subspace modeling technique from the perspective of maximum likelihood, it essentially assumes homoscedastic measurement errors. However, heteroscedastic errors are common in multivariate chemical data. Thus, PCA often fails to extract useful information in cases of significantly heteroscedastic errors. Maximum likelihood principal component analysis (MLPCA) has been developed to address heteroscedastic errors in multivariate data, but its application in exploratory data analysis has not been examined. Chapter 2 of this thesis describes strategies for exploratory data analysis in situations with highly heteroscedastic errors, including the application of MLPCA. A partial transparency projection (PTP) technique is also introduced to improve the visualization by using the measurement error information. Following from the work in Chapter 2, Chapter 3 proposes a new optimization algorithm for MLPCA model with non-zero intercepts.
Projection pursuit (PP) is another important method for exploratory data analysis. PP is less widely used compared with PCA, but is more powerful than PCA in many cases. One major reason for the limited applications of PP is the difficulty in implementing PP efficiently. Chapter 4 describes new algorithms, referred to as quasi-power methods, for the optimization of kurtosis that is used as an objective function for projection pursuit. As an extension to the work in Chapter 4, regularized projection pursuit (RPP), designed to deal with data that have a small sample-to-variable ratio, is proposed in Chapter 5. This method is particularly relevant in chemical applications because chemical data typically have few samples but many variables.