68 research outputs found
Clusterwise methods, past and present
International audienceInstead of fitting a single and global model (regression, PCA, etc.) to a set of observations, clusterwise methods look simultaneously for a partition into k clusters and k local models optimizing some criterion. There are two main approaches: 1. the least squares approach introduced by E.Diday in the 70's, derived from k-means 2. mixture models using maximum likelihood but only the first one easily enables prediction. After a survey of classical methods, we will present recent extensions to functional, symbolic and multiblock data
Error Bounds for Piecewise Smooth and Switching Regression
The paper deals with regression problems, in which the nonsmooth target is
assumed to switch between different operating modes. Specifically, piecewise
smooth (PWS) regression considers target functions switching deterministically
via a partition of the input space, while switching regression considers
arbitrary switching laws. The paper derives generalization error bounds in
these two settings by following the approach based on Rademacher complexities.
For PWS regression, our derivation involves a chaining argument and a
decomposition of the covering numbers of PWS classes in terms of the ones of
their component functions and the capacity of the classifier partitioning the
input space. This yields error bounds with a radical dependency on the number
of modes. For switching regression, the decomposition can be performed directly
at the level of the Rademacher complexities, which yields bounds with a linear
dependency on the number of modes. By using once more chaining and a
decomposition at the level of covering numbers, we show how to recover a
radical dependency. Examples of applications are given in particular for PWS
and swichting regression with linear and kernel-based component functions.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice,after which this version may no
longer be accessibl
Nonparametric modeling and forecasting electricity demand: an empirical study
This paper uses half-hourly electricity demand data in South Australia as an empirical study of nonparametric modeling and forecasting methods for prediction from half-hour ahead to one year ahead. A notable feature of the univariate time series of electricity demand is the presence of both intraweek and intraday seasonalities. An intraday seasonal cycle is apparent from the similarity of the demand from one day to the next, and an intraweek seasonal cycle is evident from comparing the demand on the corresponding day of adjacent weeks. There is a strong appeal in using forecasting methods that are able to capture both seasonalities. In this paper, the forecasting methods slice a seasonal univariate time series into a time series of curves. The forecasting methods reduce the dimensionality by applying functional principal component analysis to the observed data, and then utilize an univariate time series forecasting method and functional principal component regression techniques. When data points in the most recent curve are sequentially observed, updating methods can improve the point and interval forecast accuracy. We also revisit a nonparametric approach to construct prediction intervals of updated forecasts, and evaluate the interval forecast accuracy.Functional principal component analysis; functional time series; multivariate time series, ordinary least squares, penalized least squares; ridge regression; seasonal time series
Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement
We describe many vantage points on the Baire metric and its use in clustering
data, or its use in preprocessing and structuring data in order to support
search and retrieval operations. In some cases, we proceed directly to clusters
and do not directly determine the distances. We show how a hierarchical
clustering can be read directly from one pass through the data. We offer
insights also on practical implications of precision of data measurement. As a
mechanism for treating multidimensional data, including very high dimensional
data, we use random projections.Comment: 17 pages, 45 citations, 2 figure
CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS
The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
Advances in robust clustering methods with applications
Robust methods in statistics are mainly concerned with deviations from model assumptions.
As already pointed out in Huber (1981) and in Huber & Ronchetti
(2009) \these assumptions are not exactly true since they are just a mathematically
convenient rationalization of an often fuzzy knowledge or belief". For that reason \a
minor error in the mathematical model should cause only a small error in the nal
conclusions". Nevertheless it is well known that many classical statistical procedures
are \excessively sensitive to seemingly minor deviations from the assumptions".
All statistical methods based on the minimization of the average square loss may
suer of lack of robustness. Illustrative examples of how outliers' in
uence may
completely alter the nal results in regression analysis and linear model context are
provided in Atkinson & Riani (2012). A presentation of classical multivariate tools'
robust counterparts is provided in Farcomeni & Greco (2015).
The whole dissertation is focused on robust clustering models and the outline of the
thesis is as follows.
Chapter 1 is focused on robust methods. Robust methods are aimed at increasing
the eciency when contamination appears in the sample. Thus a general denition
of such (quite general) concept is required. To do so we give a brief account of
some kinds of contamination we can encounter in real data applications. Secondly
we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the
cornerstone of the robust model based clustering models. Such model is aimed at
formalizing clustering problems when one has to deal with contaminated samples.
The assumption standing behind the \Spurious outliers model" is that two dierent
random mechanisms generate the data: one is assumed to generate the \clean"
part while the another one generates the contamination. This idea is actually very
common within robust models like the \Tukey-Huber model" which is introduced in
Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a
key role and is not straightforward as the dimensionality of the data increases. An
overview of the most widely used (robust) methods for outliers detection is provided
within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the
classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology.
Chapter 2 is focused on model based clustering methods and their robustness' properties.
Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw
1990), is one of the most widely used tools within the unsupervised learning context.
A very popular method is the k-means algorithm (MacQueen et al. 1967) which is
based on minimizing the Euclidean distance of each observation from the estimated
clusters' centroids and therefore it is aected by lack of robustness. Indeed even a
single outlying observation may completely alter centroids' estimation and simultaneously
provoke a bias in the standard errors' estimation. Cluster's contours may be
in
ated and the \real" underlying clusterwise structure might be completely hidden.
A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos
et al. (1997), where a trimming step is inserted in the algorithm in order to avoid
the outliers' exceeding in
uence.
It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic
clusters. Whenever more
exible shapes are desired the procedure becomes
inecient. In order to overcome this problem Gaussian model based clustering
methods should be adopted instead of k-means algorithm. An example, among
the other proposals described in Chapter 2, is the TCLUST methodology (Garca-
Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is
based on two main characteristics: trimming a xed proportion of observations and
imposing a constraint on the estimates of the scatter matrices. As it will be explained
in Chapter 2, trimming is used to protect the results from outliers' in
uence
while the constraint is involved as spurious maximizers may completely spoil the
solution.
Chapter 3 and 4 are mainly focused on extending the TCLUST methodology.
In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al.
2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted
TCLUST or RTCLUST for the sake of brevity. The idea standing behind such
method is based on reweighting the observations initially
agged as outlying. This
is helpful both to gain eciency in the parameters' estimation process and to provide
a reliable estimation of the true contamination level. Indeed, as the TCLUST
is based on trimming a xed proportion of observations, a proper choice of the
trimming level is required. Such choice, especially in the applications, can be cumbersome.
As it will be claried later on, RTCLUST methodology allows the user to
overcome such problem. Indeed, in the RTCLUST approach the user is only required
to impose a high preventive trimming level. The procedure, by iterating through a
sequence of decreasing trimming levels, is aimed at reinserting the discarded observations
at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^.
The theoretical properties of the methodology are studied in Section 3.6 and proved
in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating
the properties of the methodology and the advantages with respect to some other
robust (reweigthed and single step procedures).
Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering
(Dotto et al. 2016a). Such contribution can be viewed as the extension of
Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension
of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy
clustering framework. Fuzzy clustering is also useful to deal with contamination.
Fuzziness is introduced to deal with overlapping between clusters and the presence
of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case
of overlapping between clusters and may completely alter the estimated cluster's
parameters (i.e. the coecients of a linear model in each cluster). By introducing
fuzziness such observations are suitably down weighted and the clusterwise structure
can be correctly detected. On the other hand, robustness against gross outliers,
as in the TCLUST methodology, is guaranteed by trimming a xed proportion of
observations. Additionally a simulation study, aimed at comparing the proposed
methodology with other proposals (both robust and non robust) is also provided in
Section 4.4.
Chapter 5 is entirely dedicated to real data applications of the proposed contributions.
In particular, the RTCLUST method is applied to two dierent datasets. The
rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering
models, and to a dataset collected by Gallup Organization, which is, to our
knowledge, an original dataset, on which no other existing proposals have been applied
yet. Section 5.3 contains an application of our fuzzy linear clustering proposal
to allometry data. In our opinion such dataset, already considered in the robust
linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar &
San Martn (2010), is particularly useful to show the advantages of our proposed
methodology. Indeed allometric quantities are often linked by a linear relationship
but, at the same time, there may be overlap between dierent groups and outliers
may often appear due to errors in data registration.
Finally Chapter 6 contains the concluding remarks and the further directions of
research. In particular we wish to mention an ongoing work (Dotto & Farcomeni,
In preparation) in which we consider the possibility of implementing robust parsimonious
Gaussian clustering models. Within the chapter, the algorithm is briefly
described and some illustrative examples are also provided. The potential advantages
of such proposals are the following. First of all, by considering the parsimonious
models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role.
Secondly, by constraining the shape of the detected clusters, the constraint on the
eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of
the procedure and, at the same time, allows the user to obtain ane equivariant estimators.
Finally, since the possibility of trimming a xed proportion of observations
is allowed, then the procedure is also formally robust
Nonparametric time series forecasting with dynamic updating
We present a nonparametric method to forecast a seasonal univariate time series, and propose four dynamic updating methods to improve point forecast accuracy. Our methods consider a seasonal univariate time series as a functional time series. We propose first to reduce the dimensionality by applying functional principal component analysis to the historical observations, and then to use univariate time series forecasting and functional principal component regression techniques. When data in the most recent year are partially observed, we improve point forecast accuracy using dynamic updating methods. We also introduce a nonparametric approach to construct prediction intervals of updated forecasts, and compare the empirical coverage probability with an existing parametric method. Our approaches are data-driven and computationally fast, and hence they are feasible to be applied in real time high frequency dynamic updating. The methods are demonstrated using monthly sea surface temperatures from 1950 to 2008.Functional time series, Functional principal component analysis, Ordinary least squares, Penalized least squares, Ridge regression, Sea surface temperatures, Seasonal time series.
Fuzzy C-ordered medoids clustering of interval-valued data
Fuzzy clustering for interval-valued data helps us to find natural vague boundaries in such data. The
Fuzzy c-Medoids Clustering (FcMdC) method is one of the most popular clustering methods based on a
partitioning around medoids approach. However, one of the greatest disadvantages of this method is its
sensitivity to the presence of outliers in data. This paper introduces a new robust fuzzy clustering
method named Fuzzy c-Ordered-Medoids clustering for interval-valued data (FcOMdC-ID). The Huber's
M-estimators and the Yager's Ordered Weighted Averaging (OWA) operators are used in the method
proposed to make it robust to outliers. The described algorithm is compared with the fuzzy c-medoids
method in the experiments performed on synthetic data with different types of outliers. A real application of the FcOMdC-ID is also provided
SOLUSI KUADRAT TERKECIL MODEL REGRESI FUZZY DENGAN VARIABEL DEPENDEN FUZZY SIMETRIS
ABSTRACT: It is known that the regression model used to analyze the crisp data. In the condition that there is uncertainty (imprecision, vagueness) on values of observed variables, then the fuzzy regression model is required. In this paper discussed a fuzzy regression model with symmetrical fuzzy dependent variable and crisp independent variables using the least squares approach. The method used to find the linear model is to minimize the distance function between observed fuzzy variables with the estimation dependent variable or the corresponding theoretical value. It is shown that the solution of this model is a generalization of the classical regression model. Further discussion is about the properties of solution of the model.Keyword: center model, spread model, fuzzy distance, iterative solution
- âŠ