17,681 research outputs found
Basic statistics for probabilistic symbolic variables: a novel metric-based approach
In data mining, it is usually to describe a set of individuals using some
summaries (means, standard deviations, histograms, confidence intervals) that
generalize individual descriptions into a typology description. In this case,
data can be described by several values. In this paper, we propose an approach
for computing basic statics for such data, and, in particular, for data
described by numerical multi-valued variables (interval, histograms, discrete
multi-valued descriptions). We propose to treat all numerical multi-valued
variables as distributional data, i.e. as individuals described by
distributions. To obtain new basic statistics for measuring the variability and
the association between such variables, we extend the classic measure of
inertia, calculated with the Euclidean distance, using the squared Wasserstein
distance defined between probability measures. The distance is a generalization
of the Wasserstein distance, that is a distance between quantile functions of
two distributions. Some properties of such a distance are shown. Among them, we
prove the Huygens theorem of decomposition of the inertia. We show the use of
the Wasserstein distance and of the basic statistics presenting a k-means like
clustering algorithm, for the clustering of a set of data described by modal
numerical variables (distributional variables), on a real data set. Keywords:
Wasserstein distance, inertia, dependence, distributional data, modal
variables.Comment: 19 pages, 3 figure
Linear regression for numeric symbolic variables: an ordinary least squares approach based on Wasserstein Distance
In this paper we present a linear regression model for modal symbolic data.
The observed variables are histogram variables according to the definition
given in the framework of Symbolic Data Analysis and the parameters of the
model are estimated using the classic Least Squares method. An appropriate
metric is introduced in order to measure the error between the observed and the
predicted distributions. In particular, the Wasserstein distance is proposed.
Some properties of such metric are exploited to predict the response variable
as direct linear combination of other independent histogram variables. Measures
of goodness of fit are discussed. An application on real data corroborates the
proposed method
Cultural Values and Cross-cultural Video Consumption on YouTube
Video-sharing social media like YouTube provide access to diverse cultural
products from all over the world, making it possible to test theories that the
Web facilitates global cultural convergence. Drawing on a daily listing of
YouTube's most popular videos across 58 countries, we investigate the
consumption of popular videos in countries that differ in cultural values,
language, gross domestic product, and Internet penetration rate. Although
online social media facilitate global access to cultural products, we find this
technological capability does not result in universal cultural convergence.
Instead, consumption of popular videos in culturally different countries
appears to be constrained by cultural values. Cross-cultural convergence is
more advanced in cosmopolitan countries with cultural values that favor
individualism and power inequality
Neural activity classification with machine learning models trained on interspike interval series data
The flow of information through the brain is reflected by the activity
patterns of neural cells. Indeed, these firing patterns are widely used as
input data to predictive models that relate stimuli and animal behavior to the
activity of a population of neurons. However, relatively little attention was
paid to single neuron spike trains as predictors of cell or network properties
in the brain. In this work, we introduce an approach to neuronal spike train
data mining which enables effective classification and clustering of neuron
types and network activity states based on single-cell spiking patterns. This
approach is centered around applying state-of-the-art time series
classification/clustering methods to sequences of interspike intervals recorded
from single neurons. We demonstrate good performance of these methods in tasks
involving classification of neuron type (e.g. excitatory vs. inhibitory cells)
and/or neural circuit activity state (e.g. awake vs. REM sleep vs. nonREM sleep
states) on an open-access cortical spiking activity dataset
Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time Series Data Bases
In many applications, such as physiology and finance, large time series data
bases are to be analyzed requiring the computation of linear, nonlinear and
other measures. Such measures have been developed and implemented in commercial
and freeware softwares rather selectively and independently. The Measures of
Analysis of Time Series ({\tt MATS}) {\tt MATLAB} toolkit is designed to handle
an arbitrary large set of scalar time series and compute a large variety of
measures on them, allowing for the specification of varying measure parameters
as well. The variety of options with added facilities for visualization of the
results support different settings of time series analysis, such as the
detection of dynamics changes in long data records, resampling (surrogate or
bootstrap) tests for independence and linearity with various test statistics,
and discrimination power of different measures and for different combinations
of their parameters. The basic features of {\tt MATS} are presented and the
implemented measures are briefly described. The usefulness of {\tt MATS} is
illustrated on some empirical examples along with screenshots.Comment: 25 pages, 9 figures, two tables, the software can be downloaded at
http://eeganalysis.web.auth.gr/indexen.ht
- …