Fast K-Means Clustering Via K-D Trees, Sampling, and Parallelism
MetadataShow full item record
K-means is a commonly used method for clustering in applications that require fast response time due to its speed. As data becomes large (millions of data points), the classical implementation may not achieve the performance necessary for these applications. By combining the filtering algorithm using k-d trees, aggressive sampling, and parallelism with dynamic load balancing, we implement a version of k-means that outperforms the standard algorithms used for these applications. We find that aggressive sampling at 1% of the dataset combined with the filtering algorithm provides significant speed-up without sacrificing accuracy. Overheads in implementing parallel methods prevent significant speed-up on smaller datasets, especially when the data has already been sampled, but our experiments show that this improves as the dataset grows.