Minimum spanning tree-based cluster analysis: A new algorithm for determining inconsistent edges


Şaar F., Topcu A. E.

Concurrency and Computation: Practice and Experience, 2021 (Journal Indexed in SCI Expanded) identifier

  • Publication Type: Article / Article
  • Publication Date: 2021
  • Doi Number: 10.1002/cpe.6717
  • Title of Journal : Concurrency and Computation: Practice and Experience
  • Keywords: Euclidean distance, graph-based clustering algorithm, inconsistent edges, MST

Abstract

© 2021 John Wiley & Sons, Ltd.In recent years, graph-based data clustering algorithms have become popular as they perform connectivity-based rather than centroid-based partitioning. Methods related to minimum spanning tree (MST)-based data clustering are types of graph-based algorithms that can recognize arbitrary shapes of clusters by eliminating inconsistent edges from MST graphs. In all MST-based data clustering algorithms, definition of inconsistent edges is the main problem that needs to be addressed. The longest edges in MST graphs are considered as inconsistent edges under ideal conditions. Nevertheless, outliers often exist in real-world tasks, which makes the longest edges inaccurate cluster separation indicators. In this paper, we propose a new data clustering algorithm using MST and a critical distance method. The proposed algorithm solves the main issue of MST-based data clustering, namely identifying and removing inconsistent edges to obtain clusters even in the event that the dataset contains some outliers. It begins by constructing the MST over a given weighted graph based on Euclidean distance and then splits up the graph into clusters by eliminating inconsistent edges using critical distance as a threshold. Integration of the advantages of both MST and critical distance methodology to obtain optimal clusters is the main contribution of this work. The conducted experimental analysis and results using different datasets prove that our proposed clustering algorithm yields better overall performance compared with the most common data clustering algorithms. Taking the Liver and Tumor datasets as an example, the proposed algorithm outperforms all other clustering algorithms with clustering accuracy equal to 0.579 and 0.660, respectively.