64 research outputs found
Basic statistics for probabilistic symbolic variables: a novel metric-based approach
In data mining, it is usually to describe a set of individuals using some
summaries (means, standard deviations, histograms, confidence intervals) that
generalize individual descriptions into a typology description. In this case,
data can be described by several values. In this paper, we propose an approach
for computing basic statics for such data, and, in particular, for data
described by numerical multi-valued variables (interval, histograms, discrete
multi-valued descriptions). We propose to treat all numerical multi-valued
variables as distributional data, i.e. as individuals described by
distributions. To obtain new basic statistics for measuring the variability and
the association between such variables, we extend the classic measure of
inertia, calculated with the Euclidean distance, using the squared Wasserstein
distance defined between probability measures. The distance is a generalization
of the Wasserstein distance, that is a distance between quantile functions of
two distributions. Some properties of such a distance are shown. Among them, we
prove the Huygens theorem of decomposition of the inertia. We show the use of
the Wasserstein distance and of the basic statistics presenting a k-means like
clustering algorithm, for the clustering of a set of data described by modal
numerical variables (distributional variables), on a real data set. Keywords:
Wasserstein distance, inertia, dependence, distributional data, modal
variables.Comment: 19 pages, 3 figure
Multiple factor analysis of distributional data
In the framework of Symbolic Data Analysis (SDA), distribution-variables are
a particular case of multi-valued variables: each unit is represented by a set
of distributions (e.g. histograms, density functions or quantile functions),
one for each variable. Factor analysis (FA) methods are primary exploratory
tools for dimension reduction and visualization. In the present work, we use
Multiple Factor Analysis (MFA) approach for the analysis of data described by
distributional variables. Each distributional variable induces a set new
numeric variable related to the quantiles of each distribution. We call these
new variables as \textit{quantile variables} and the set of quantile variables
related to a distributional one is a block in the MFA approach. Thus, MFA is
performed on juxtaposed tables of quantile variables. \\ We show that the
criterion decomposed in the analysis is an approximation of the variability
based on a suitable metrics between distributions: the squared
Wasserstein distance. \\ Applications on simulated and real distributional data
corroborate the method. The interpretation of the results on the factorial
planes is performed by new interpretative tools that are related to the several
characteristics of the distributions (location, scale and shape).Comment: Accepted from STATSTICA APPLICATA: Italian Journal of Applied
Statistics on 12/201
Analysis of the Distribution of Participation in Wikis Using the Gini Coefficient, the Frequency Distribution and the Lorenz Curve
Depto. de IngenierÃa de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEpu
Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
This paper deals with clustering methods based on adaptive distances for
histogram data using a dynamic clustering algorithm. Histogram data describes
individuals in terms of empirical distributions. These kind of data can be
considered as complex descriptions of phenomena observed on complex objects:
images, groups of individuals, spatial or temporal variant data, results of
queries, environmental data, and so on. The Wasserstein distance is used to
compare two histograms. The Wasserstein distance between histograms is
constituted by two components: the first based on the means, and the second, to
internal dispersions (standard deviation, skewness, kurtosis, and so on) of the
histograms. To cluster sets of histogram data, we propose to use Dynamic
Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is
a k-means-like algorithm for clustering a set of individuals into classes
that are apriori fixed.
The main aim of this research is to provide a tool for clustering histograms,
emphasizing the different contributions of the histogram variables, and their
components, to the definition of the clusters. We demonstrate that this can be
achieved using adaptive distances. Two kind of adaptive distances are
considered: the first takes into account the variability of each component of
each descriptor for the whole set of individuals; the second takes into account
the variability of each component of each descriptor in each cluster. We
furnish interpretative tools of the obtained partition based on an extension of
the classical measures (indexes) to the use of adaptive distances in the
clustering criterion function. Applications on synthetic and real-world data
corroborate the proposed procedure
Linear regression for numeric symbolic variables: an ordinary least squares approach based on Wasserstein Distance
In this paper we present a linear regression model for modal symbolic data.
The observed variables are histogram variables according to the definition
given in the framework of Symbolic Data Analysis and the parameters of the
model are estimated using the classic Least Squares method. An appropriate
metric is introduced in order to measure the error between the observed and the
predicted distributions. In particular, the Wasserstein distance is proposed.
Some properties of such metric are exploited to predict the response variable
as direct linear combination of other independent histogram variables. Measures
of goodness of fit are discussed. An application on real data corroborates the
proposed method
A new approach for measuring and analysing residential segregation
This work proposes a new approach for residential segregation analysis contributing to the methodological debate relating to the measurement of the phenomenon and its comparability between different urban contexts. The strategy of analysis involves the use of areal interpolation methods to create high-resolution population grids, a compositional data approach, and the implementation of factorial analysis to define a socio-economic class composition index based on categorical data, which is a common data type in social research. The latter, in combination with spatial autocorrelation tools and the adoption of a criterion based on temporal distances to define spatial relations between grid cells, enables the identification and mapping of segregated areas. To test our method, we rely on the latest UK census data (2021) for the metropolitan areas of Liverpool, Manchester, and Newcastle upon Tyne, employing social groups defined according to the National Statistics Socio-economic Classification provided by the Office for National Statistics as population data. Finally, the validity of the proposed methodology is demonstrated through case studies, and the results are interpreted within the broader theoretical framework on the topic
- …