Box-K-Means symbolic data analysis: A k-means algorithm for boxplot variables with application to global climate Variability
DOI:
https://doi.org/10.64497/jssci.12Keywords:
Symbolic Data Analysis, Clustering, K-Means, Boxplot Variables, Climate Science, Pattern Recognition, Unsupervised LearningAbstract
A foundational methodology for the hierarchical clustering of boxplot-valued symbolic data was established in 2006 by addressing an interesting gap in the literature of Symbolic Data Analysis (SDA). Against this background, a robust partitioning-based clustering method for this data type has remained an open area of research. We develop a novel k-means-style algorithm specifically designed for boxplot variables, termed "Box-K-Means." The proposed algorithm defines a "mean boxplot" as a cluster prototype and utilizes the robust, weighted Ichino-Yaguchi distance metric for measuring dissimilarity between objects and prototypes. To validate the method, we apply it to a large-scale, contemporary climate dataset derived from the Berkeley Earth Surface Temperature Study, summarizing multi-decade temperature records from thousands of global weather stations into symbolic boxplot variables. The performance of Box-K-Means is evaluated against the original hierarchical method using internal validation metrics (Silhouette Score, Davies-Bouldin Index) and externally using the Adjusted Rand Index (ARI) for climate classification. The results demonstrate that the Box-K-Means algorithm provides a computationally efficient alternative for large symbolic datasets and successfully identifies coherent clusters in agreement with established climate zones. This work proposes a new tool for the SDA toolkit, expanding the potential for partitioning-based analysis of complex, distribution-based data in fields such as climatology, finance, and beyond.
Downloads
References
References
[1] L. Billard and E. Diday, Symbolic Data Analysis: Conceptual Statistics and Data Mining. Hoboken, NJ, USA: John Wiley & Sons, 2006.
[2] E. Diday and M. Noirhomme-Fraiture, Eds., Symbolic Data Analysis and the SODAS Software. Hoboken, NJ, USA: John Wiley & Sons, 2018.
[3] J. Arroyo, C. Maté, and A. Muñoz-San Roque, “Hierarchical clustering for boxplot variables,” in Data Science and Classification, V. Batagelj et al., Eds. Berlin, Germany: Springer, 2006, pp. 119-126.
[4] C. Maté, “Symbolic data analysis: An overview and some applications,” Poiesis Prax, vol. 11, no.1–2, pp. 49–69, 2015.
[5] M. Ichino, “The general model of symbolic data and the SODAS software,” Eur. J. Oper. Res., vol.184, no. 3, pp. 936–953, 2008.
[6] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010.
[7] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and Applications. Philadelphia, PA, USA: SIAM, 2007.
[8] S. Arlot, A. Celisse, and Z. Harchaoui, “A survey of cross-validation procedures for model selection,” Stat. Surv., vol. 4, pp. 40–79, 2010.
[9] M. Brun, C. Szafnicki, and A. Ben-Dor, “Validation of clustering,” in Computational Systems Biology, 2007, pp. 269–291.
[10] R. A. Rohde, “Berkeley Earth Temperature Averaging Process,” Berkeley Earth, Tech. Rep., 2013.
[11] R. Rohde, R. A. Muller, and E. A. Muller, “A validated, daily, half-degree, land-surface air temperature record for 1979–2019,” Earth Syst. Sci. Data, vol. 13, no. 1, pp. 227–243, 2021.
[12] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–65, 1987.
[13] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal. Mach. Intell., no. 1, pp. 224–227, 1979.
[14] M. Kottek, J. Grieser, C. Beck, B. Rudolf, and F. Rubel, “World map of the Köppen-Geiger climate classification updated,” Meteorol. Z., vol. 15, no. 3, pp. 259–263, 2006.
[15] L. Hubert and P. Arabie, “Comparing partitions,” J. Classif., vol. 2, no. 1, pp. 193–218, 1985.
[16] F. Rubel, K. Brugger, K. Haslinger, and I. Auer, “The climate of the European Alps: Shift of very high resolution Köppen-Geiger climate zones 1981–2010,” Meteorol. Z., vol. 26, no. 2, pp. 115–125, 2017.
[17] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, 2010.
[18] R. M. de Souza and F. de A. T. de Carvalho, “Clustering symbolic data,” in Symbolic Data Analysis and the SODAS Software, 2019, pp. 259–324.
[19] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat. Probab., vol. 1, 1967, pp. 281–297
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Babangida Ibrahim BABURA, Lawan Muhammad Bulama , Dau Abba Umar

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Abstract 394
- PDF 131

