Evolutionary feature selection for machine learning based malware classification


Kale G., BOSTANCI G. E., ÇELEBİ F. V.

Engineering Science and Technology, an International Journal, vol.56, 2024 (SCI-Expanded) identifier

  • Publication Type: Article / Article
  • Volume: 56
  • Publication Date: 2024
  • Doi Number: 10.1016/j.jestch.2024.101762
  • Journal Name: Engineering Science and Technology, an International Journal
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, INSPEC, Directory of Open Access Journals
  • Keywords: Feature selection, Machine learning, Malware analysis, Malware classification, Multi-objective genetic algorithms
  • Ankara Yıldırım Beyazıt University Affiliated: Yes

Abstract

Conducting thorough research, analysis, and detection of cyber-threatening malware with the right parameters is crucial for safeguarding a country's security and economy. Increasingly sophisticated cyber-attacks directly affect individual welfare, social dynamics, and political stability. So, due to the evolving nature of malware, which continuously improves itself to evade detection, it is even more essential to select effective and decisive parameters, considering interactions among various malware features. As malware evolves with new technologies and techniques, signature-based detection systems are becoming inadequate. Instead of relying on these still widely used but insufficient systems, in this study a new system was established focusing on malware behavior and the relationships between malware features resulting from these behaviors. In this system, rather than using a uniform approach, multi-objective genetic algorithms (MOGAs) are employed to select critical and decisive features for malware detection. These selected features are then utilized by machine learning (ML) algorithms within the implemented hybrid framework to accurately detect and classify malware. The aim of this paper is to identify the optimal feature selection and classification methods yielding the highest accuracy within the Cuckoo Sandbox environment. Specifically, the J48 Decision Tree (J48), Reduced Error Pruning Tree (REP Tree), Adaptive Boosting Model 1 (AdaboostM1), Multilayer Perceptron (MLP), and Naive Bayes (NB) classifiers were assessed. Through our analysis, the feature set was refined from 335 to 200, considering inter-feature relationships, resulting in a peak accuracy of 93.33% and a corresponding 40% performance enhancement due to the reduction in the number of features. The obtained metrics were meticulously compared and evaluated with respect to the employed algorithms and methodologies. Additionally, Mc Nemar's test was utilized to evaluate the performance of different malware detection classifiers by comparing their correct and incorrect classifications. Notably, the Mc Nemar's test revealed significant improvements upon analysis of the results.