Box-K-Means symbolic data analysis: A k-means algorithm for boxplot variables with application to global climate Variability

Authors

  • Babangida Ibrahim BABURA Department of Applied Mathematics Federal University of Technology Babura, Nigeria https://orcid.org/0000-0001-8007-8657
  • Lawan Mohammad Bulama Department of Mathematics, Federal University Dutse, Jigawa State, Nigeria
  • Dau Abba Umar Department of Environmental Sciences, Federal University Dutse, Jigawa State, Nigeria

DOI:

https://doi.org/10.64497/jssci.12

Keywords:

Symbolic Data Analysis, Clustering, K-Means, Boxplot Variables, Climate Science, Pattern Recognition, Unsupervised Learning

Abstract

A foundational methodology for the hierarchical clustering of boxplot-valued symbolic data was established in 2006 by addressing an interesting gap in the literature of Symbolic Data Analysis (SDA). Against this background, a robust partitioning-based clustering method for this data type has remained an open area of research. We develop a novel k-means-style algorithm specifically designed for boxplot variables, termed "Box-K-Means." The proposed algorithm defines a "mean boxplot" as a cluster prototype and utilizes the robust, weighted Ichino-Yaguchi distance metric for measuring dissimilarity between objects and prototypes. To validate the method, we apply it to a large-scale, contemporary climate dataset derived from the Berkeley Earth Surface Temperature Study, summarizing multi-decade temperature records from thousands of global weather stations into symbolic boxplot variables. The performance of Box-K-Means is evaluated against the original hierarchical method using internal validation metrics (Silhouette Score, Davies-Bouldin Index) and externally using the Adjusted Rand Index (ARI) for climate classification. The results demonstrate that the Box-K-Means algorithm provides a computationally efficient alternative for large symbolic datasets and successfully identifies coherent clusters in agreement with established climate zones. This work proposes a new tool for the SDA toolkit, expanding the potential for partitioning-based analysis of complex, distribution-based data in fields such as climatology, finance, and beyond.

Downloads

Download data is not yet available.

References

References

[1] L. Billard and E. Diday, Symbolic Data Analysis: Conceptual Statistics and Data Mining. Hoboken, NJ, USA: John Wiley & Sons, 2006.

[2] E. Diday and M. Noirhomme-Fraiture, Eds., Symbolic Data Analysis and the SODAS Software. Hoboken, NJ, USA: John Wiley & Sons, 2018.

[3] J. Arroyo, C. Maté, and A. Muñoz-San Roque, “Hierarchical clustering for boxplot variables,” in Data Science and Classification, V. Batagelj et al., Eds. Berlin, Germany: Springer, 2006, pp. 119-126.

[4] C. Maté, “Symbolic data analysis: An overview and some applications,” Poiesis Prax, vol. 11, no.1–2, pp. 49–69, 2015.

[5] M. Ichino, “The general model of symbolic data and the SODAS software,” Eur. J. Oper. Res., vol.184, no. 3, pp. 936–953, 2008.

[6] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010.

[7] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and Applications. Philadelphia, PA, USA: SIAM, 2007.

[8] S. Arlot, A. Celisse, and Z. Harchaoui, “A survey of cross-validation procedures for model selection,” Stat. Surv., vol. 4, pp. 40–79, 2010.

[9] M. Brun, C. Szafnicki, and A. Ben-Dor, “Validation of clustering,” in Computational Systems Biology, 2007, pp. 269–291.

[10] R. A. Rohde, “Berkeley Earth Temperature Averaging Process,” Berkeley Earth, Tech. Rep., 2013.

[11] R. Rohde, R. A. Muller, and E. A. Muller, “A validated, daily, half-degree, land-surface air temperature record for 1979–2019,” Earth Syst. Sci. Data, vol. 13, no. 1, pp. 227–243, 2021.

[12] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–65, 1987.

[13] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal. Mach. Intell., no. 1, pp. 224–227, 1979.

[14] M. Kottek, J. Grieser, C. Beck, B. Rudolf, and F. Rubel, “World map of the Köppen-Geiger climate classification updated,” Meteorol. Z., vol. 15, no. 3, pp. 259–263, 2006.

[15] L. Hubert and P. Arabie, “Comparing partitions,” J. Classif., vol. 2, no. 1, pp. 193–218, 1985.

[16] F. Rubel, K. Brugger, K. Haslinger, and I. Auer, “The climate of the European Alps: Shift of very high resolution Köppen-Geiger climate zones 1981–2010,” Meteorol. Z., vol. 26, no. 2, pp. 115–125, 2017.

[17] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, 2010.

[18] R. M. de Souza and F. de A. T. de Carvalho, “Clustering symbolic data,” in Symbolic Data Analysis and the SODAS Software, 2019, pp. 259–324.

[19] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat. Probab., vol. 1, 1967, pp. 281–297

Downloads

Published

2025-09-03

How to Cite

BABURA, B. I., Bulama , L. M., & Umar, D. A. (2025). Box-K-Means symbolic data analysis: A k-means algorithm for boxplot variables with application to global climate Variability. Journal of Statistical Sciences and Computational Intelligence, 1(3), 157–165. https://doi.org/10.64497/jssci.12
Views
  • Abstract 394
  • PDF 131

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.