FAST CLUSTERING WITH NOISE REMOVAL FOR LARGE DATASETS

Odebode, Afees

FAST CLUSTERING WITH NOISE REMOVAL FOR LARGE DATASETS

dc.contributor.author	Odebode, Afees
dc.contributor.copyright-release	Not Applicable	en_US
dc.contributor.degree	Master of Computer Science	en_US
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.external-examiner	n/a	en_US
dc.contributor.graduate-coordinator	Malcolm Heywood	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.thesis-reader	Dr. Vlado Keselj	en_US
dc.contributor.thesis-reader	Dr. Qigang Gao	en_US
dc.contributor.thesis-supervisor	Dr.Srinivas Sampalli	en_US
dc.contributor.thesis-supervisor	Dr. Qiang Ye	en_US
dc.date.accessioned	2017-09-01T18:13:54Z
dc.date.available	2017-09-01T18:13:54Z
dc.date.defence	2017-08-11
dc.date.issued	2017-09-01T18:13:54Z
dc.description	Thesis submission	en_US
dc.description.abstract	Availability of large temporal data enabled by improved collection tools and storage devices has posed a new set of challenges in data mining, especially in the area of clustering data into different groups according to the basic attributes. The existing clustering algorithms, such as K-means, tend to suffer from slow processing speed. In addition, most of them lack the ability to eliminate outliers and anomalies. In this thesis, we present three fast clustering algorithms with noise removal capability: KD, KDS, and KDSD. Technically, the proposed algorithms make use of the features of three existing data mining methods, K-means, DBSCAN and K-Nearest Neighbor (KNN). K-means has been an effective clustering algorithm. However, the clusters resulting from K-means are likely to include many outliers. In addition, K-means does not scale well with cluster size. In our research, to tackle the outlier problem, we proposed KD, a novel clustering algorithm with noise removal capability that is based on K-means and DBSCAN. Essentially, DBSCAN is employed to remove the outliers in the clusters resulting from K-means. To solve the scaling problem with K-means, we proposed KDS, a fast clustering algorithm that scales well. Finally, KDSD, a fast clustering algorithm with noise removal capability was proposed to achieve both excellent scalability and noise removal ability. The performance of the proposed algorithms is thoroughly investigated through extensive experiments with a large power consumption data set. Our experimental results indicate that, compared to K-means, KDS runs at a much faster rate. Specifically, it takes K-means 7.56 seconds to cluster the whole data set under investigation. However, it takes KDS 0.363 seconds and 0.513 seconds in the case of 1% and 5% training sample respectively. In addition, although KDSD is not as fast as KDS due to the final anomaly removal operation, it outperforms KD. In our experiments, it takes KD 268.62 seconds to complete the clustering process while it takes KDSD 237.836 seconds in the worst case.	en_US
dc.identifier.uri	http://hdl.handle.net/10222/73283
dc.language.iso	en	en_US
dc.subject	Outliers	en_US
dc.subject	Clustering	en_US
dc.subject	Smart Meters	en_US
dc.title	FAST CLUSTERING WITH NOISE REMOVAL FOR LARGE DATASETS	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Odebode-Afees-MCS-CSCI-August-2017.pdf
Size:: 3.64 MB
Format:: Adobe Portable Document Format
Description:: Fnal Thesis Submission

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Faculty of Graduate Studies Online Theses