Repository logo
 

Fast K-Means Clustering Via K-D Trees, Sampling, and Parallelism

Date

2019-08-29T14:19:08Z

Authors

Crowell, Thomas

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

K-means is a commonly used method for clustering in applications that require fast response time due to its speed. As data becomes large (millions of data points), the classical implementation may not achieve the performance necessary for these applications. By combining the filtering algorithm using k-d trees, aggressive sampling, and parallelism with dynamic load balancing, we implement a version of k-means that outperforms the standard algorithms used for these applications. We find that aggressive sampling at 1% of the dataset combined with the filtering algorithm provides significant speed-up without sacrificing accuracy. Overheads in implementing parallel methods prevent significant speed-up on smaller datasets, especially when the data has already been sampled, but our experiments show that this improves as the dataset grows.

Description

Keywords

K-means, Parallel computing, K-D trees, Clustering

Citation