Ensemble Learning for Visual Recognition
Abstract
Due to the widespread availability of affordable mobile cameras and popularity of social networking, a large variety of visual-based applications are emerging. These diverse applications have attracted many researchers and high-profile companies around the world, and led to the development of a wide range of approaches for visual content analysis. In the last two decades, a large number of methodologies have been proposed for enhanced classification of images and videos. However, the potential improvement in visual data classification through classifier fusion by ensemble-based methods has remained relatively unstudied.
In this dissertation, we present new ensemble classification models and propose the employment of such models for some challenging visual recognition tasks, including both image classification and human action recognition. First, we present a unifying framework for multiple classifier systems, which unites most classification methods by an ensemble of classifiers. Following this perspective, we propose a new general approach to ensemble classification, named generic subclass ensemble.
Then, we focus on the subclass approach, specifically on Error Correcting Output Codes (ECOC). We propose a subspace approach to ECOC by defining a 3D-ECOC matrix, where the third dimension corresponds to the feature space. Also, Genetic Algorithm is employed for a highly discriminative design of an application-dependent subspace ECOC.
Then, the ensemble approach is utilized for action recognition using depth data; and present three new methods that improve the classification of depth-based action videos. First, we address two action recognition problems using skeleton data; and propose the use of an ensemble framework based on the Dempster-Shafer fusion of classifiers. In the second method, a new class of SVM that is applicable to trajectory classification, such as action recognition, is developed by incorporating two efficient time-series distances measures into the kernel function. The third method is based on the well-known Bag of Visual Words (BoVW) framework: we propose a new coding algorithm by jointly encoding the set of local descriptors of each sample and considering the locality structure of descriptors. In this model, two classifiers are trained using both skeleton and depth feature sets, and then combined. Experimental comparison of the proposed methodologies shows the superiority of our methods compared to the state-of-the-art.