Performance and Explainability of Machine Learning Models in Phishing Detection Using SHAP

Dhurgham Kareem Gharkan

doi:10.23851/mjs.v36i4.1707

Authors

Dhurgham Kareem Gharkan Department of Cybersecurity Techniques, Technical Institute Kut, Middle Technical University, Baghdad, Iraq https://orcid.org/0000-0002-0862-2977

DOI:

https://doi.org/10.23851/mjs.v36i4.1707

Keywords:

Phishing attack, Machine learning, SHAP, Feature selection, Cyber-attack detection

Abstract

Background: Phishing is a common cybercrime attack, and it is also considered a social crime that has been going on for more than two decades. Phishing aims to trick users into revealing their private information, including banking information, passwords, and account credentials. Phishing remains a real threat and usually occurs via instant messages, email, or phone calls. Objective: Used shape analysis on the system to uncover the most important features that contribute to phishing detection. A set of key features was identified. Many phishing detection methods have been used recently, but they do not provide a complete understanding of the impact of different features on predictions. Methods: Several machine learning strategies based on SHAP (Shappley Additive Explanations) were applied, which enhanced the classification model. This paper proposes a fast model based on a set of contemporary machine learning techniques. Results: Experiments showed that the proposed model achieved a maximum accuracy of 99.1% for K-NN and 98.5% for XGBoost on the Phishing_Legitimate_full dataset. K-NN has demonstrated superior performance and interpretability, which is critical for security-critical applications. Conclusions: The results highlight the balance between predictive performance and interpretability. This provides valuable transparency into the decision-making process. This makes it a more practical choice for real-world phishing detection systems, where reliability and interpretability are critical.

Downloads

Download data is not yet available.

References

K. Adane, B. Beyene, and M. Abebe, “Single and hybrid-ensemble learning-based phishing website detection: Examining impacts of varied nature datasets and informative feature selection technique,” Digital Threats: Research and Practice, vol. 4, no. 3, pp. 1–27, 2023.
CrossRef | Google Scholar

A. A. Athulya and K. Praveen, “Towards the detection of phishing attacks,” in 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), IEEE, Jun. 2020, pp. 337–343.
CrossRef | Google Scholar

A. Alharbi, A. Alotaibi, L. Alghofaili, M. Alsalamah, N. Alwasil, and S. Elkhediri, “Security in social-media: Awareness of phishing attacks techniques and countermeasures,” in 2022 2nd International Conference on Computing and Information Technology (ICCIT), IEEE, Jan. 2022, pp. 10–16.
CrossRef | Google Scholar

S. Dangwal and A.-N. Moldovan, “Feature selection for machine learning-based phishing websites detection,” in 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), IEEE, Jun. 2021, pp. 1–6.
CrossRef | Google Scholar

G. A. Kothamasu, S. K. Angara Venkata, Y. Pemmasani, and S. Mathi, “An investigation on vulnerability analysis of phishing attacks and countermeasures,” International Journal of Safety and Security Engineering, vol. 13, no. 2, pp. 333–340, 2023.
CrossRef | Google Scholar

A. R. Javed, M. O. Beg, M. Asim, T. Baker, and A. H. Al-Bayatti, “AlphaLogger: Detecting motion-based side-channel attack using smartphone keystrokes,” Journal of Ambient Intelligence and Humanized Computing, vol. 14, no. 5, pp. 4869–4882, 2020.
CrossRef | Google Scholar

M. Mittal, C. Iwendi, S. Khan, and A. Rehman Javed, “Analysis of security and energy efficiency for shortest route discovery in low‐energy adaptive clustering hierarchy protocol using Levenberg‐Marquardt neural network and gated recurrent unit for intrusion detection system,” Transactions on Emerging Telecommunications Technologies, vol. 32, no. 6, Art no. e3997, 2020.
CrossRef | Google Scholar

A. Rehman Javed, Z. Jalil, S. Atif Moqurrab, S. Abbas, and X. Liu, “Ensemble Adaboost classifier for accurate and fast detection of botnet attacks in connected vehicles,” Transactions on Emerging Telecommunications Technologies, vol. 33, no. 10, Art no. e4088, 2020.
CrossRef | Google Scholar

A. Rasool and Z. Jalil, “A review of web browser forensic analysis tools and techniques,” Researchpedia Journal of Computing, vol. 1, no. 1, pp. 15–21, 2020.
Google Scholar | Link

Z. Dong, A. Kapadia, J. Blythe, and L. J. Camp, “Beyond the lock icon: Real-time detection of phishing websites using public key certificates,” in 2015 APWG Symposium on Electronic Crime Research (eCrime), IEEE, May 2015, pp. 1–12.
CrossRef | Google Scholar

C. Iwendi, Z. Jalil, A. R. Javed, T. Reddy G., R. Kaluri, G. Srivastava, and O. Jo, “KeySplitWatermark: Zero watermarking algorithm for software protection against cyber-attacks,” IEEE Access, vol. 8, pp. 72650–72660, 2020.
CrossRef | Google Scholar

D. K. Gharkan and A. A. Abdulrahman, “Construct an efficient distributed denial of service attack detection system based on data mining techniques,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 29, no. 1, Art no. 591, 2022.
CrossRef | Google Scholar

R. Basnet, S. Mukkamala, and A. H. Sung, “Detection of phishing attacks: A machine learning approach,” in Soft Computing Applications in Industry. Springer Berlin Heidelberg, 2008, vol. 226, pp. 373–383.
CrossRef | Google Scholar

S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing email detection using natural language processing techniques: A literature survey,” Procedia Computer Science, vol. 189, pp. 19–28, 2021.
CrossRef | Google Scholar

S. Bell and P. Komisarczuk, “An analysis of phishing blacklists: Google safe browsing, OpenPhish, and PhishTank,” in Proceedings of the Australasian Computer Science Week Multiconference, ser. ACSW '20, ACM, Feb. 2020, pp. 1–11.
CrossRef | Google Scholar

R. Goenka, M. Chawla, and N. Tiwari, “A comprehensive survey of phishing: mediums, intended targets, attack and defence techniques and a novel taxonomy,” International Journal of Information Security, vol. 23, no. 2, pp. 819–848, 2024.
CrossRef | Google Scholar

A. Karim, M. Shahroz, K. Mustofa, S. B. Belhaouari, and S. R. K. Joga, “Phishing detection system through hybrid machine learning based on URL,” IEEE Access, vol. 11, pp. 36805–36822, 2023.
CrossRef | Google Scholar

M. Sanchez-Paniagua, E. F. Fernandez, E. Alegre, W. Al-Nabki, and V. Gonzalez-Castro, “Phishing URL detection: A real-case scenario through login URLs,” IEEE Access, vol. 10, pp. 42949–42960, 2022.
CrossRef | Google Scholar

A. Basit, M. Zafar, A. R. Javed, and Z. Jalil, “A novel ensemble machine learning method to detect phishing attack,” in 2020 IEEE 23rd International Multitopic Conference (INMIC), IEEE, Nov. 2020.
CrossRef | Google Scholar

N. A. Azeez, S. Misra, I. A. Margaret, L. Fernandez-Sanz, and S. M. Abdulhamid, “Adopting automated whitelist approach for detecting phishing attacks,” Computers & Security, vol. 108, Art no. 102328, Sep. 2021.
CrossRef | Google Scholar

M. Adebowale, K. Lwin, E. Sánchez, and M. Hossain, “Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text,” Expert Systems with Applications, vol. 115, pp. 300–313, Jan. 2019.
CrossRef | Google Scholar

M. Jasim and L. E. George, “Phishing attacks detection by using artificial neural networks,” Iraqi Journal for Computer Science and Mathematics, vol. 4, no. 3, pp. 159–166, 2023.
CrossRef | Google Scholar

L. Barlow, G. Bendiab, S. Shiaeles, and N. Savage, “A novel approach to detect phishing attacks using binary visualisation and machine learning,” in 2020 IEEE World Congress on Services (SERVICES), IEEE, Oct. 2020, pp. 177–182.
CrossRef | Google Scholar

F. Castaño, E. F. Fernandez, R. Alaiz-Rodríguez, and E. Alegre, “PhiKitA: Phishing kit attacks dataset for phishing websites identification,” IEEE Access, vol. 11, pp. 40779–40789, Apr. 2023.
CrossRef | Google Scholar

S. Uplenchwar, V. Sawant, P. Surve, S. Deshpande, and S. Kelkar, “Phishing attack detection on text messages using machine learning techniques,” in 2022 IEEE Pune Section International Conference (PuneCon), IEEE, Dec. 2022, pp. 1–5.
CrossRef | Google Scholar

L. R. Kalabarige, R. S. Rao, A. Abraham, and L. A. Gabralla, “Multilayer stacked ensemble learning model to detect phishing websites,” IEEE Access, vol. 10, pp. 79543–79552, 2022.
CrossRef | Google Scholar

O. K. Sahingoz, E. Buber, and E. Kugu, “DEPHIDES: Deep learning based phishing detection system,” IEEE Access, vol. 12, pp. 8052–8070, 2024.
CrossRef | Google Scholar

T. Dokeroglu, A. Deniz, and H. E. Kiziloz, “A comprehensive survey on recent metaheuristics for feature selection,” Neurocomputing, vol. 494, pp. 269–296, Jul. 2022.
CrossRef | Google Scholar

F. Karimi, M. B. Dowlatshahi, and A. Hashemi, “SemiACO: A semi-supervised feature selection based on ant colony optimization,” Expert Systems with Applications, vol. 214, Art no. 119130, Mar. 2023.
CrossRef | Google Scholar

Z. Liu, H. Qiu, S. Letchmunan, M. Deveci, and L. Abualigah, “Multi-view evidential c-means clustering with view-weight and feature-weight learning,” Fuzzy Sets and Systems, vol. 498, Art no. 109135, Jan. 2025.
CrossRef | Google Scholar

P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining. Pearson Education India, 2016.
Google Scholar | Link

S. Uddin, I. Haque, H. Lu, M. A. Moni, and E. Gide, “Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction,” Scientific Reports, vol. 12, no. 1, Art no. 6256, 2022.
CrossRef | Google Scholar | PubMed

S. K. Alam, P. Li, M. Rahman, M. Fida, and V. Elumalai, “Key factors affecting groundwater nitrate levels in the Yinchuan Region, Northwest China: Research using the eXtreme Gradient Boosting (XGBoost) model with the SHapley Additive exPlanations (SHAP) method,” Environmental Pollution, vol. 364, Art no. 125336, Jan. 2025.
CrossRef | Google Scholar | PubMed

S. Wali, Y. A. Farrukh, and I. Khan, “Explainable AI and Random Forest based reliable intrusion detection system,” Computers & Security, vol. 157, Art no. 104542, Oct. 2025.
CrossRef | Google Scholar

M. Daviran, A. Maghsoudi, and R. Ghezelbash, “Optimized AI-MPM: Application of PSO for tuning the hyperparameters of SVM and RF algorithms,” Computers & Geosciences, vol. 195, Art no. 105785, Feb. 2025.
CrossRef | Google Scholar

D. Dey, M. S. Haque, M. M. Islam, U. I. Aishi, S. S. Shammy, M. S. A. Mayen, S. T. A. Noor, and M. J. Uddin, “The proper application of logistic regression model in complex survey data: A systematic review,” BMC Medical Research Methodology, vol. 25, no. 1, Art no. 15, 2025.
CrossRef | Google Scholar | PubMed

K. Merabet, F. Di Nunno, F. Granata, S. Kim, R. M. Adnan, S. Heddam, O. Kisi, and M. Zounemat-Kermani, “Predicting water quality variables using gradient boosting machine: global versus local explainability using SHapley Additive Explanations (SHAP),” Earth Science Informatics, vol. 18, no. 3, Art no. 298, 2025.
CrossRef | Google Scholar