Robust clustering in (moderately) high dimensional cases

L.A. Garcia-Escudero

{}^{1}

${}^{1}$

Departamento de Estadística e I.O. and IMUVA, University of Valladolid [lagarcia@uva.es]

Keywords: Robust clustering – Trimming – Model-based clustering – Cellwise contamination

Abstract

Outliers can negatively impact Cluster Analysis. One might view outliers as separate clusters, leading to the idea that simply increasing the number of clusters, $K$ , could be a natural way to manage them. However, this approach is often not the best strategy and can even be completely impractical. Consequently, several robust clustering techniques have been developed to address this issue. These techniques are also valuable for identifying potentially meaningful anomalies in data, particularly when working with datasets that naturally include different subpopulations. In this talk, we will focus exclusively on robust clustering methods based on trimming [see García-Escudero and Mayo-Iscar, 2024, for a recent review], with particular emphasis on TCLUST [García-Escudero et al., 2008], an extension of the well-known MCD method [Rousseeuw, 1985] for Cluster Analysis.

TCLUST, along with the algorithms and packages used for its implementation, is highly reliable for low-dimensional data. However, in Statistics, problems involving high-dimensional data are becoming increasingly common, and TCLUST’s performance is known to deteriorate significantly as dimensionality increases. This presentation will address this challenge and explore some promising initial solutions, particularly for moderately high-dimensional cases.

The main difficulty in applying TCLUST to high-dimensional data lies in the large number of parameters required to handle the $K$ scatter matrices of the fitted components. A reasonable approach to “regularizing” the TCLUST objective function is to constrain the maximum ratio between the eigenvalues of these scatter matrices to be close to 1. However, this regularization limits the detectable clusters to spherical shapes with equal dispersion, which can be overly restrictive.

An alternative approach, as dimensionality increases, is to assume that the different clusters are grouped around $K$ subspaces of dimension lower than the ambient space. This approach is employed in the Robust Linear Grouping method [García-Escudero et al., 2009], which can be viewed as a simultaneous clustering and dimensionality reduction technique. However, Robust Linear Grouping does not take into account the information related to the specific “coordinates” of the projection of the observations onto the $K$ approximating subspaces since only orthogonal errors are taken into account.

To find a compromise between TCLUST and Robust Linear Grouping, a robust extension of the HDDC method [Bouveyron et al., 2007] is proposed through trimming and appropriate constraints. An algorithm for implementing this methodology will be introduced, and its application will be illustrated with examples.

When addressing increasing dimensionality in robust clustering based on trimming, additional non-trivial aspects must be considered. One key issue is the proper initialization of the concentration steps typically applied at the algorithmic level. While random initialization is feasible in principle, ensuring a reliable starting point would require many random initializations, highlighting the need for improved schemes. Another important aspect is incorporating cellwise trimming rather than just casewise trimming, as discarding entire rows can remove too much valuable information. Proposals to address these two issues will be presented. Finally, note that extremely high-dimensional cases are not considered. Handling very high dimensions is challenging, even in the absence of contamination, and often requires sparsity assumptions. Some interesting approaches in this direction, such as those by Kondo et al. [2016] and Brodinová et al. [2019], will be briefly discussed.

References

C. Bouveyron, S. Girard, and C. Schmid (2007) High-dimensional data clustering. Computational Statistics and Data Analysis 52, pp. 502–519. Cited by: Abstract.
S. Brodinová, P. Filzmoser, T. Ortner, C. Breiteneder, S. Rohm, P. Filzmoser, T. Ortner, C. Breiteneder, and M. Rohm (2019) Robust and sparse $K$ -means clustering for high-dimensional data. Advances in Data Analysis and Classification 13, pp. 905–932. Cited by: Abstract.
L.A. García-Escudero, A. Gordaliza, C. Matrán, and A. Mayo-Iscar (2008) A general trimming approach to robust cluster analysis. Annals of Statistics 36, pp. 1324–1345. Cited by: Abstract.
L.A. García-Escudero, A. Gordaliza, R. San Martín, S. van Aelst, and R. Zamar (2009) Robust linear clustering. Journal of the Royal Statistical Society. Series B: Statistical Methodology 71, pp. 301–318. Cited by: Abstract.
L.A. García-Escudero and A. Mayo-Iscar (2024) Robust clustering based on trimming. Wiley Interdisciplinary Reviews. Computational Statistics 16 (e1658). Cited by: Abstract.
Y. Kondo, M. Salibian-Barrera, and R. Zamar (2016) RSKC: an r package for a robust and sparse $k$ -means clustering algorithm. Journal of Statistical Software 72, pp. 1–26. Cited by: Abstract.
P. Rousseeuw (1985) Multivariate estimation with high breakdown point. In Mathematical statistics and applications, W. Grossmann, G. Pflug, I. Vincze, and W. Wertz (Eds.), Vol. Vol. B, pp. 283–297. Cited by: Abstract.