Fast K-Means Clustering Via K-D Trees, Sampling, and Parallelism

Crowell, Thomas

Fast K-Means Clustering Via K-D Trees, Sampling, and Parallelism

Files

Crowell-Thomas-MCS-CSCI-August-2019.pdf (854.08 KB)

Date

2019-08-29T14:19:08Z

Authors

Crowell, Thomas

Abstract

K-means is a commonly used method for clustering in applications that require fast response time due to its speed. As data becomes large (millions of data points), the classical implementation may not achieve the performance necessary for these applications. By combining the filtering algorithm using k-d trees, aggressive sampling, and parallelism with dynamic load balancing, we implement a version of k-means that outperforms the standard algorithms used for these applications. We find that aggressive sampling at 1% of the dataset combined with the filtering algorithm provides significant speed-up without sacrificing accuracy. Overheads in implementing parallel methods prevent significant speed-up on smaller datasets, especially when the data has already been sampled, but our experiments show that this improves as the dataset grows.

Keywords

K-means, Parallel computing, K-D trees, Clustering

URI

http://hdl.handle.net/10222/76345

Collections

Faculty of Graduate Studies Online Theses

Full item page

Fast K-Means Clustering Via K-D Trees, Sampling, and Parallelism

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections