FAST CLUSTERING WITH NOISE REMOVAL FOR LARGE DATASETS
Availability of large temporal data enabled by improved collection tools and storage devices has posed a new set of challenges in data mining, especially in the area of clustering data into different groups according to the basic attributes. The existing clustering algorithms, such as K-means, tend to suffer from slow processing speed. In addition, most of them lack the ability to eliminate outliers and anomalies. In this thesis, we present three fast clustering algorithms with noise removal capability: KD, KDS, and KDSD. Technically, the proposed algorithms make use of the features of three existing data mining methods, K-means, DBSCAN and K-Nearest Neighbor (KNN). K-means has been an effective clustering algorithm. However, the clusters resulting from K-means are likely to include many outliers. In addition, K-means does not scale well with cluster size. In our research, to tackle the outlier problem, we proposed KD, a novel clustering algorithm with noise removal capability that is based on K-means and DBSCAN. Essentially, DBSCAN is employed to remove the outliers in the clusters resulting from K-means. To solve the scaling problem with K-means, we proposed KDS, a fast clustering algorithm that scales well. Finally, KDSD, a fast clustering algorithm with noise removal capability was proposed to achieve both excellent scalability and noise removal ability. The performance of the proposed algorithms is thoroughly investigated through extensive experiments with a large power consumption data set. Our experimental results indicate that, compared to K-means, KDS runs at a much faster rate. Specifically, it takes K-means 7.56 seconds to cluster the whole data set under investigation. However, it takes KDS 0.363 seconds and 0.513 seconds in the case of 1% and 5% training sample respectively. In addition, although KDSD is not as fast as KDS due to the final anomaly removal operation, it outperforms KD. In our experiments, it takes KD 268.62 seconds to complete the clustering process while it takes KDSD 237.836 seconds in the worst case.