NC-Link A New Linkage Method for Efficient Hierarchical Clustering of large scale Data
In various disciplines, hierarchical clustering (HC) has been an effective tool for data analysis due to its ability to summarize hierarchical structures of data in an intuitive and interpretable manner. A run of HC requires multiple iterations, each of which needs to compute and update the pairwise distances between all intermediate clusters. This makes the exact algorithm for HC inevitably suffer from quadratic time and space complexities. To address large-scale data, various approximate/parallel algorithms have been proposed to reduce the computational cost of HC. However, such algorithms still rely on conventional linkage methods (such as single, centroid, average, complete, or Ward's) for dening pairwise distances, mostly focusing on the approximation/parallelization of linkage computations. Given that the choice of linkage profoundly affects not only the quality but also the efciency of HC, we propose a newlinkage method named NC-link and design an exact algorithm for NC-link-based HC. To guarantee the exactness, the proposed algorithm maintains the quadratic nature in time complexity but exhibits only linear space complexity, thereby allowing us to address million-object data on a personal computer. To underpin the extensibility of our approach, we showthat the algorithmic nature of NC-link enables single instruction multiple data (SIMD) parallelization and subquadratic-time approximation of HC. To verify our proposal, we thoroughly tested it with a number of large-scale real and synthetic data sets. In terms of efciency, NC-link allowed us to perform HC substantially more space efciently or faster than conventional methods: compared with average and complete linkages, using NC-link incurred only 0.7Percent1.75Percent of the memory usage, and the NC-linkbased implementation delivered speedups of approximately 3.5 times over the centroid andWard's linkages. With regard to clustering quality, the proposed method was able to retrieve hierarchical structures from input data as faithfully as in the popular average and centroid linkage methods. We anticipate that the existing approximation/parallel algorithms will be able to benet from adopting NC-link as their linkage method for obtaining better clustering results and reduced time and space demands.
Clustering algorithm, data mining, machine learning.