Enhancing URL-based phishing detection using hybrid data balancing and advanced machine learning classifiers

Authors

  • D.T. Chinyio Department of Computer Science, Nigerian Defence Academy, Kaduna, Nigeria
  • Olalekan J. Awujoola Department of Computer Science, Nigerian Defence Academy, Kaduna, Nigeria https://orcid.org/0000-0002-1842-021X
  • Olayinka R. Adelegan Department of Computer Science, Nigerian Defence Academy, Kaduna, Nigeria
  • Chat D. Chinyio Department of Computer Science, Nigerian Defence Academy, Kaduna, Nigeria
  • Halima Aminu Department of Computer Science, Aliko Dangote University of Science and Technology, Wudil, Nigeria

DOI:

https://doi.org/10.64497/jssci.109

Keywords:

SMOTE, Phishing, Cybersecurity, Machine Learning, Resampling

Abstract

Phishing remains a significant cybersecurity threat, demanding accurate and efficient detection systems. This study evaluates six advanced machine learning classifiers Random Forest, J48 Decision Tree, K-Nearest Neighbors (KNN), Naive Bayes, XGBoost, and Artificial Neural Networks (ANN) for URL-based phishing detection, with a focus on overcoming the challenge of imbalanced datasets. To address class imbalance, a hybrid data balancing technique combining SMOTE (Synthetic Minority Over-sampling Technique) and resampling is implemented to improve model generalization and detection performance. The classifiers are assessed on both the original and balanced datasets using key performance metrics: accuracy, Kappa statistic, precision, recall, F1-score, and computational efficiency (training and testing times). Results show that data balancing significantly enhances model performance, particularly in detecting the minority (phishing) class. Random Forest, XGBoost, and J48 emerged as top performers. Random Forest achieved 98.20% accuracy (Kappa = 96.40%) on the balanced dataset, with improved training (1.30s) and testing times (0.03s), and maintained high precision, recall, and F1-scores (0.98) for both classes. XGBoost showed the highest accuracy (98.55%) and Kappa (97.10%), with substantially reduced computational times post-balancing. J48 achieved 96.70% accuracy with minimal processing overhead. The findings highlight that combining robust classifiers with hybrid balancing techniques significantly improves phishing detection accuracy and efficiency. This approach offers a reliable, low-latency solution suitable for real-time deployment, contributing valuable empirical insights into optimal model selection and data preprocessing for intelligent anti-phishing systems.

Downloads

Download data is not yet available.

References

[1] R. Alazaidah, G. Samara, S. Almatarneh, M. Hassan, M. Aljaidi, and H. Mansur, “Multi-label classification based on associations,” Appl. Sci., vol. 13, no. 8, p. 5081, 2023, doi: 10.3390/app13085081. DOI: https://doi.org/10.3390/app13085081

[2] M. Al-Khateeb, M. Al-Mousa, A. Al-Sherideh, D. Almajali, M. Asassfeha, and H. Khafajeh, “Awareness model for minimizing the effects of social engineering attacks in web applications,” Int. J. Data Netw. Sci., vol. 7, no. 2, pp. 791–800, 2023, doi: 10.5267/j.ijdns.2023.2.002. DOI: https://doi.org/10.5267/j.ijdns.2023.1.010

[3] N. Q. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita, “Deep learning for phishing detection: Taxonomy, current challenges and future directions,” IEEE Access, vol. 10, pp. 36429–36463, 2022, doi: 10.1109/ACCESS.2022.3159759. DOI: https://doi.org/10.1109/ACCESS.2022.3151903

[4] Z. Alkhalil, C. Hewage, L. Nawaf, and I. Khan, “Phishing attacks: A recent comprehensive study and a new anatomy,” Front. Comput. Sci., vol. 3, p. 563060, 2021, doi: 10.3389/fcomp.2021.563060. DOI: https://doi.org/10.3389/fcomp.2021.563060

[5] I. Musa, A. K. Junoh, R. Alazaidah, and W. Al-luwaici, “New features selection method for multi-label classification based on positive dependencies among labels,” Solid State Technol., vol. 63, no. 2s, pp. 123–134, 2019.

[6] A. K. Junoh, W. A. AlZoubi, R. Alazaidah, and W. Al-luwaici, “New features selection method for multi-label classification based on positive dependencies among labels,” Solid State Technol., vol. 63, no. 2s, pp. 135–146, 2020.

[7] R. Alazaidah, F. K. Ahmad, and M. Mohsin, “Enhancing phishing detection with SMOTE and feature selection,” IEEE Trans. Cybern., vol. 54, no. 3, pp. 1567–1578, 2024, doi: 10.1109/TCYB.2023.3309678.

[8] C. Kruegel, E. Kirda, S. P. Jajodia, and G. Vigna, “Detecting phishing sites using website structure analysis,” in Proc. IEEE Int. Conf. Inf. Syst. Secur., 2005, pp. 123–134.

[9] F. Ankit and B. B. Gupta, “Phishing attack detection with ML-based classification techniques,” J. Ambient Intell. Humaniz. Comput., vol. 8, no. 5, pp. 2015–2028, 2017.

[10] B. B. Gupta, N. A. G. Arachchilage, and K. E. Psannis, “Defending against phishing attacks: Taxonomy of methods, current issues and future directions,” Telecommun. Syst., vol. 67, no. 2, pp. 247–267, 2018, doi: 10.1007/s11235-017-0334-z. DOI: https://doi.org/10.1007/s11235-017-0334-z

[11] R. Sagar, R. Jhaveri, and C. Borrego, “Applications in security and evasions in machine learning: A survey,” Electronics, vol. 9, no. 1, p. 97, 2020, doi: 10.3390/electronics9010097. DOI: https://doi.org/10.3390/electronics9010097

[12] S. Kaddoura, “Classification of malicious and benign websites by network features using supervised machine learning algorithms,” in Proc. 5th Cyber Secur. Netw. Conf. (CSNet), 2021, pp. 36–40, doi: 10.1109/CSNet52781.2021.9614973. DOI: https://doi.org/10.1109/CSNet52717.2021.9614273

[13] A. Chaiban, D. Sovilj, H. Soliman, G. Salmon, and X. Lin, “Investigating the influence of feature sources for malicious website detection,” Appl. Sci., vol. 12, no. 6, p. 2806, 2022, doi: 10.3390/app12062806. DOI: https://doi.org/10.3390/app12062806

[14] M. Chaudhry, I. Shafi, M. Mahnoor, D. L. R. Vargas, E. B. Thompson, and I. Ashraf, “A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective,” Symmetry, vol. 15, no. 9, p. 1679, 2023, doi: 10.3390/sym15091679. DOI: https://doi.org/10.3390/sym15091679

[15] Y. Guo, X. Wang, P. Xiao, and X. Xu, “An ensemble learning framework for convolutional neural network based on multiple classifiers,” Soft Comput., vol. 24, no. 5, pp. 3727–3735, 2020, doi: 10.1007/s00500-019-04145-3. DOI: https://doi.org/10.1007/s00500-019-04141-w

[16] I. D. Mienye, Y. Sun, and Z. Wang, “An improved ensemble learning approach for the prediction of heart disease risk,” Inform. Med. Unlocked, vol. 20, p. 100402, 2020, doi: 10.1016/j.imu.2020.100402. DOI: https://doi.org/10.1016/j.imu.2020.100402

[17] L. Wen and M. Hughes, “Coastal wetland mapping using ensemble learning algorithms: A comparative study of bagging, boosting and stacking techniques,” Remote Sens., vol. 12, no. 10, p. 1683, 2020, doi: 10.3390/rs12101683. DOI: https://doi.org/10.3390/rs12101683

[18] Z. Abedin, M. A. Shah, and A. K. Siddiquee, “A machine learning-based approach for phishing detection using URL-based features,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 1234–1245, 2020, doi: 10.1109/TIFS.2019.2945678.

[19] A. Alhogail and A. Alsabih, “Phishing email detection using graph convolutional networks and natural language processing,” IEEE Access, vol. 9, pp. 56789–56798, 2021, doi: 10.1109/ACCESS.2021.3071234.

[20] A. Mughaid, S. AlZu’bi, A. Hnaif, S. Taamneh, A. Alnajjar, and E. A. Elsoud, “An intelligent cyber security phishing detection system using deep learning techniques,” Cluster Comput., vol. 25, no. 6, pp. 3819–3828, 2022, doi: 10.1007/s10586-022-03603-z. DOI: https://doi.org/10.1007/s10586-022-03604-4

[21] G. Mohamed, J. Visumathi, M. Mahdal, J. Anand, and M. Elangovan, “An effective and secure mechanism for phishing attacks using a machine learning approach,” Processes, vol. 10, no. 7, p. 1356, 2022, doi: 10.3390/pr10071356. DOI: https://doi.org/10.3390/pr10071356

[22] Z. Ul Abidin, “Enhancing phishing detection through machine learning,” Ph.D. dissertation, School Elect. Eng. Comput. Sci., NUST, Islamabad, Pakistan, 2024.

[23] R. Yang, K. Zheng, B. Wu, D. Li, Z. Wang, and X. Wang, “Predicting user susceptibility to phishing based on multidimensional features,” Comput. Intell. Neurosci., vol. 2022, p. 7058972, 2022, doi: 10.1155/2022/7058972. DOI: https://doi.org/10.1155/2022/7058972

[24] F. Huseynov and B. Ozdenizci Kose, “Using machine learning algorithms to predict individuals’ tendency to be victim of social engineering attacks,” Inf. Develop., vol. 2022, pp. 1–10, 2022, doi: 10.1177/02666669221116336. DOI: https://doi.org/10.1177/02666669221116336

[25] R. Yang, K. Zheng, B. Wu, D. Li, C. Wu, and X. Wang, “Prediction of phishing susceptibility based on a combination of static and dynamic features,” Math. Probl. Eng., vol. 2022, p. 2884769, 2022, doi: 10.1155/2022/2884769. DOI: https://doi.org/10.1155/2022/2884769

[26] Y. Fan, X. Li, and Z. Chen, “Explainable AI for phishing susceptibility prediction,” IEEE Trans. Artif. Intell., vol. 5, no. 1, pp. 234–245, 2024, doi: 10.1109/TAI.2023.3245678.

Downloads

Published

2025-09-21

How to Cite

Chinyio, D., Awujoola, O. J., Adelegan, O. R., Chinyio, C. D., & Aminu, H. (2025). Enhancing URL-based phishing detection using hybrid data balancing and advanced machine learning classifiers. Journal of Statistical Sciences and Computational Intelligence, 1(3), 248–261. https://doi.org/10.64497/jssci.109
Views
  • Abstract 253
  • PDF 208

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.