Efficient density and cluster based incremental outlier detection in data streams


DEĞİRMENCİ A., KARAL Ö.

Information Sciences, vol.607, pp.901-920, 2022 (SCI-Expanded) identifier

  • Publication Type: Article / Article
  • Volume: 607
  • Publication Date: 2022
  • Doi Number: 10.1016/j.ins.2022.06.013
  • Journal Name: Information Sciences
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Computer & Applied Sciences, INSPEC, Library, Information Science & Technology Abstracts (LISTA), Metadex, MLA - Modern Language Association Database, zbMATH, Civil Engineering Abstracts
  • Page Numbers: pp.901-920
  • Keywords: Core KNN, Data stream, DBSCAN, Incremental learning, LOF, Outlier detection
  • Ankara Yıldırım Beyazıt University Affiliated: Yes

Abstract

© 2022 Elsevier Inc.In this paper, a novel, parameter-free, incremental local density and cluster-based outlier factor (iLDCBOF) method is presented that unifies incremental versions of local outlier factor (LOF) and density-based spatial clustering of applications with noise (DBSCAN) to detect outliers efficiently in data streams. The iLDCBOF has many advanced advantages compared to previously reported iLOF-based studies: (1) it is based on a newly-developed core k-nearest neighbor (CkNN) concept to reliably and scalably detect outliers from data streams and prevent the clustering of outliers; 2) it uses a newly-developed algorithm that automatically adjusts the value of the k (number of neighbors) parameter for different real-time applications; and 3) it uses the Mahalanobis distance metric, so its performance is not affected even for large amounts of data. The iLDCBOF method is well suited for different data stream applications because it requires no distribution assumptions, it is parameterless (determined automatically), and it is easy to implement. ROC-AUC and statistical test analysis results from extensive experiments performed on 16 different real-world datasets showed that the iLDCBOF method significantly outperformed benchmark methods.