Enhancing URL-based phishing detection using hybrid data balancing and advanced machine learning classifiers
DOI:
https://doi.org/10.64497/jssci.109Keywords:
SMOTE, Phishing, Cybersecurity, Machine Learning, ResamplingAbstract
Phishing remains a significant cybersecurity threat, demanding accurate and efficient detection systems. This study evaluates six advanced machine learning classifiers Random Forest, J48 Decision Tree, K-Nearest Neighbors (KNN), Naive Bayes, XGBoost, and Artificial Neural Networks (ANN) for URL-based phishing detection, with a focus on overcoming the challenge of imbalanced datasets. To address class imbalance, a hybrid data balancing technique combining SMOTE (Synthetic Minority Over-sampling Technique) and resampling is implemented to improve model generalization and detection performance. The classifiers are assessed on both the original and balanced datasets using key performance metrics: accuracy, Kappa statistic, precision, recall, F1-score, and computational efficiency (training and testing times). Results show that data balancing significantly enhances model performance, particularly in detecting the minority (phishing) class. Random Forest, XGBoost, and J48 emerged as top performers. Random Forest achieved 98.20% accuracy (Kappa = 96.40%) on the balanced dataset, with improved training (1.30s) and testing times (0.03s), and maintained high precision, recall, and F1-scores (0.98) for both classes. XGBoost showed the highest accuracy (98.55%) and Kappa (97.10%), with substantially reduced computational times post-balancing. J48 achieved 96.70% accuracy with minimal processing overhead. The findings highlight that combining robust classifiers with hybrid balancing techniques significantly improves phishing detection accuracy and efficiency. This approach offers a reliable, low-latency solution suitable for real-time deployment, contributing valuable empirical insights into optimal model selection and data preprocessing for intelligent anti-phishing systems.
Downloads
References
[1] R. Alazaidah, G. Samara, S. Almatarneh, M. Hassan, M. Aljaidi, and H. Mansur, “Multi-label classification based on associations,” Appl. Sci., vol. 13, no. 8, p. 5081, 2023, doi: 10.3390/app13085081. DOI: https://doi.org/10.3390/app13085081
[2] M. Al-Khateeb, M. Al-Mousa, A. Al-Sherideh, D. Almajali, M. Asassfeha, and H. Khafajeh, “Awareness model for minimizing the effects of social engineering attacks in web applications,” Int. J. Data Netw. Sci., vol. 7, no. 2, pp. 791–800, 2023, doi: 10.5267/j.ijdns.2023.2.002. DOI: https://doi.org/10.5267/j.ijdns.2023.1.010
[3] N. Q. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita, “Deep learning for phishing detection: Taxonomy, current challenges and future directions,” IEEE Access, vol. 10, pp. 36429–36463, 2022, doi: 10.1109/ACCESS.2022.3159759. DOI: https://doi.org/10.1109/ACCESS.2022.3151903
[4] Z. Alkhalil, C. Hewage, L. Nawaf, and I. Khan, “Phishing attacks: A recent comprehensive study and a new anatomy,” Front. Comput. Sci., vol. 3, p. 563060, 2021, doi: 10.3389/fcomp.2021.563060. DOI: https://doi.org/10.3389/fcomp.2021.563060
[5] I. Musa, A. K. Junoh, R. Alazaidah, and W. Al-luwaici, “New features selection method for multi-label classification based on positive dependencies among labels,” Solid State Technol., vol. 63, no. 2s, pp. 123–134, 2019.
[6] A. K. Junoh, W. A. AlZoubi, R. Alazaidah, and W. Al-luwaici, “New features selection method for multi-label classification based on positive dependencies among labels,” Solid State Technol., vol. 63, no. 2s, pp. 135–146, 2020.
[7] R. Alazaidah, F. K. Ahmad, and M. Mohsin, “Enhancing phishing detection with SMOTE and feature selection,” IEEE Trans. Cybern., vol. 54, no. 3, pp. 1567–1578, 2024, doi: 10.1109/TCYB.2023.3309678.
[8] C. Kruegel, E. Kirda, S. P. Jajodia, and G. Vigna, “Detecting phishing sites using website structure analysis,” in Proc. IEEE Int. Conf. Inf. Syst. Secur., 2005, pp. 123–134.
[9] F. Ankit and B. B. Gupta, “Phishing attack detection with ML-based classification techniques,” J. Ambient Intell. Humaniz. Comput., vol. 8, no. 5, pp. 2015–2028, 2017.
[10] B. B. Gupta, N. A. G. Arachchilage, and K. E. Psannis, “Defending against phishing attacks: Taxonomy of methods, current issues and future directions,” Telecommun. Syst., vol. 67, no. 2, pp. 247–267, 2018, doi: 10.1007/s11235-017-0334-z. DOI: https://doi.org/10.1007/s11235-017-0334-z
[11] R. Sagar, R. Jhaveri, and C. Borrego, “Applications in security and evasions in machine learning: A survey,” Electronics, vol. 9, no. 1, p. 97, 2020, doi: 10.3390/electronics9010097. DOI: https://doi.org/10.3390/electronics9010097
[12] S. Kaddoura, “Classification of malicious and benign websites by network features using supervised machine learning algorithms,” in Proc. 5th Cyber Secur. Netw. Conf. (CSNet), 2021, pp. 36–40, doi: 10.1109/CSNet52781.2021.9614973. DOI: https://doi.org/10.1109/CSNet52717.2021.9614273
[13] A. Chaiban, D. Sovilj, H. Soliman, G. Salmon, and X. Lin, “Investigating the influence of feature sources for malicious website detection,” Appl. Sci., vol. 12, no. 6, p. 2806, 2022, doi: 10.3390/app12062806. DOI: https://doi.org/10.3390/app12062806
[14] M. Chaudhry, I. Shafi, M. Mahnoor, D. L. R. Vargas, E. B. Thompson, and I. Ashraf, “A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective,” Symmetry, vol. 15, no. 9, p. 1679, 2023, doi: 10.3390/sym15091679. DOI: https://doi.org/10.3390/sym15091679
[15] Y. Guo, X. Wang, P. Xiao, and X. Xu, “An ensemble learning framework for convolutional neural network based on multiple classifiers,” Soft Comput., vol. 24, no. 5, pp. 3727–3735, 2020, doi: 10.1007/s00500-019-04145-3. DOI: https://doi.org/10.1007/s00500-019-04141-w
[16] I. D. Mienye, Y. Sun, and Z. Wang, “An improved ensemble learning approach for the prediction of heart disease risk,” Inform. Med. Unlocked, vol. 20, p. 100402, 2020, doi: 10.1016/j.imu.2020.100402. DOI: https://doi.org/10.1016/j.imu.2020.100402
[17] L. Wen and M. Hughes, “Coastal wetland mapping using ensemble learning algorithms: A comparative study of bagging, boosting and stacking techniques,” Remote Sens., vol. 12, no. 10, p. 1683, 2020, doi: 10.3390/rs12101683. DOI: https://doi.org/10.3390/rs12101683
[18] Z. Abedin, M. A. Shah, and A. K. Siddiquee, “A machine learning-based approach for phishing detection using URL-based features,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 1234–1245, 2020, doi: 10.1109/TIFS.2019.2945678.
[19] A. Alhogail and A. Alsabih, “Phishing email detection using graph convolutional networks and natural language processing,” IEEE Access, vol. 9, pp. 56789–56798, 2021, doi: 10.1109/ACCESS.2021.3071234.
[20] A. Mughaid, S. AlZu’bi, A. Hnaif, S. Taamneh, A. Alnajjar, and E. A. Elsoud, “An intelligent cyber security phishing detection system using deep learning techniques,” Cluster Comput., vol. 25, no. 6, pp. 3819–3828, 2022, doi: 10.1007/s10586-022-03603-z. DOI: https://doi.org/10.1007/s10586-022-03604-4
[21] G. Mohamed, J. Visumathi, M. Mahdal, J. Anand, and M. Elangovan, “An effective and secure mechanism for phishing attacks using a machine learning approach,” Processes, vol. 10, no. 7, p. 1356, 2022, doi: 10.3390/pr10071356. DOI: https://doi.org/10.3390/pr10071356
[22] Z. Ul Abidin, “Enhancing phishing detection through machine learning,” Ph.D. dissertation, School Elect. Eng. Comput. Sci., NUST, Islamabad, Pakistan, 2024.
[23] R. Yang, K. Zheng, B. Wu, D. Li, Z. Wang, and X. Wang, “Predicting user susceptibility to phishing based on multidimensional features,” Comput. Intell. Neurosci., vol. 2022, p. 7058972, 2022, doi: 10.1155/2022/7058972. DOI: https://doi.org/10.1155/2022/7058972
[24] F. Huseynov and B. Ozdenizci Kose, “Using machine learning algorithms to predict individuals’ tendency to be victim of social engineering attacks,” Inf. Develop., vol. 2022, pp. 1–10, 2022, doi: 10.1177/02666669221116336. DOI: https://doi.org/10.1177/02666669221116336
[25] R. Yang, K. Zheng, B. Wu, D. Li, C. Wu, and X. Wang, “Prediction of phishing susceptibility based on a combination of static and dynamic features,” Math. Probl. Eng., vol. 2022, p. 2884769, 2022, doi: 10.1155/2022/2884769. DOI: https://doi.org/10.1155/2022/2884769
[26] Y. Fan, X. Li, and Z. Chen, “Explainable AI for phishing susceptibility prediction,” IEEE Trans. Artif. Intell., vol. 5, no. 1, pp. 234–245, 2024, doi: 10.1109/TAI.2023.3245678.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 D.T. Chinyio, Olalekan J. Awujoola, Olayinka R. Adelegan, Chat D. Chinyio, Halima Aminu

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Abstract 253
- PDF 208

