Regression for Interval-Valued Variables: from Ordinary Least Squares to Robust Alternatives

M.R. Oliveira1 and C. Amado1
Abstract
  • 1

    CEMAT & Department of Mathematics, Instituto Superior Técnico, Lisbon, Portugal
    [rosario.oliveira@tecnico.ulisboa.pt, conceicao.amado@tecnico.ulisboa.pt]

Keywords: Symbolic Data Analysis – Mallows’ Distance – Inertia Decomposition – Moore’s Algebraic Structure

Abstract

The rapid expansion of data availability has led to the emergence of new data structures and the demand for more advanced statistical methodologies. In response to these challenges, Symbolic Data Analysis (SDA) has appeared as a promising field within Statistics, aimed at addressing the complexities associated with some of these data structures.

SDA focuses on data characterized by internal variability, with interval-valued data and histogram-valued data being key representations. At its core, it builds on statistical methods (exploratory and inferential) to learn patterns from individual observations, said microdata, based on aggregate observations, known as macrodata. Data aggregation can arise for various reasons, including sample size constraints, privacy concerns, specific research objectives, or as a natural consequence of the data collection process.

The increasing presence of new data types has introduced new theoretical and methodological challenges. Among these, classical and robust regression methods for symbolic data remain an open area of research, requiring novel approaches to model relationships effectively in the presence of internal variation. Addressing these challenges is essential for advancing statistical learning techniques and improving the modelling of complex datasets in modern data science.

In this work, we examine the interval data model, introduced in Oliveira et al. [2024], which establishes the relationship between macrodata and microdata and leads to a general formalization of Mallows’ distance. By employing Moore’s definition of linear combination (see Girão Serrão et al. [2023] for further details) and utilizing the inertia decomposition of Mallows’ distance for interval data, we explicitly derive the ordinary least-squares estimators of the regression coefficients.

Similarly to the conventional ordinary least squares framework, the estimators derived in this study are susceptible to the influence of outliers. To address this limitation, we propose robust M-estimators for interval-valued regression. The performance of these robust estimators is compared with alternatives proposed by Fagundes et al. [2013] and Lima Neto and de Carvalho [2018].

Acknowledgments

We thank FCT - Fundação para a Ciência e Tecnologia, Portugal, through the project UIDB/04621/2020, with DOI: 10.54499/UIDB/04621/2020.

References

  • Fagundes et al. [2013] Roberta A.A. Fagundes, Renata M.C.R. de Souza, and Francisco José A. Cysneiros. Robust regression with application to symbolic interval data. Engineering Applications of Artificial Intelligence, 26(1):564–573, 2013. ISSN 0952-1976. doi: 10.1016/j.engappai.2012.05.004.
  • Girão Serrão et al. [2023] Rodrigo Girão Serrão, M.R. Oliveira, and Lina Oliveira. Theoretical derivation of interval principal component analysis. Information Sciences, 621:227–247, 2023. ISSN 0020-0255. doi: doi.org/10.1016/j.ins.2022.11.093.
  • Lima Neto and de Carvalho [2018] Eufrásio A. Lima Neto and Francisco A.T. de Carvalho. An exponential-type kernel robust regression model for interval-valued variables. Information Sciences, 454-455:419–442, 2018. ISSN 0020-0255. doi: 10.1016/j.ins.2018.05.008.
  • Oliveira et al. [2024] M.R. Oliveira, D. Pinheiro, and L. Oliveira. Location and association measures for interval data based on Mallows’ distance. 2024. doi: 10.48550/arXiv.2407.05105.