16,562 research outputs found
One-Step or Two-Step Optimization and the Overfitting Phenomenon: A Case Study on Time Series Classification
For the last few decades, optimization has been developing at a fast rate.
Bio-inspired optimization algorithms are metaheuristics inspired by nature.
These algorithms have been applied to solve different problems in engineering,
economics, and other domains. Bio-inspired algorithms have also been applied in
different branches of information technology such as networking and software
engineering. Time series data mining is a field of information technology that
has its share of these applications too. In previous works we showed how
bio-inspired algorithms such as the genetic algorithms and differential
evolution can be used to find the locations of the breakpoints used in the
symbolic aggregate approximation of time series representation, and in another
work we showed how we can utilize the particle swarm optimization, one of the
famous bio-inspired algorithms, to set weights to the different segments in the
symbolic aggregate approximation representation. In this paper we present, in
two different approaches, a new meta optimization process that produces optimal
locations of the breakpoints in addition to optimal weights of the segments.
The experiments of time series classification task that we conducted show an
interesting example of how the overfitting phenomenon, a frequently encountered
problem in data mining which happens when the model overfits the training set,
can interfere in the optimization process and hide the superior performance of
an optimization algorithm
Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
This paper deals with clustering methods based on adaptive distances for
histogram data using a dynamic clustering algorithm. Histogram data describes
individuals in terms of empirical distributions. These kind of data can be
considered as complex descriptions of phenomena observed on complex objects:
images, groups of individuals, spatial or temporal variant data, results of
queries, environmental data, and so on. The Wasserstein distance is used to
compare two histograms. The Wasserstein distance between histograms is
constituted by two components: the first based on the means, and the second, to
internal dispersions (standard deviation, skewness, kurtosis, and so on) of the
histograms. To cluster sets of histogram data, we propose to use Dynamic
Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is
a k-means-like algorithm for clustering a set of individuals into classes
that are apriori fixed.
The main aim of this research is to provide a tool for clustering histograms,
emphasizing the different contributions of the histogram variables, and their
components, to the definition of the clusters. We demonstrate that this can be
achieved using adaptive distances. Two kind of adaptive distances are
considered: the first takes into account the variability of each component of
each descriptor for the whole set of individuals; the second takes into account
the variability of each component of each descriptor in each cluster. We
furnish interpretative tools of the obtained partition based on an extension of
the classical measures (indexes) to the use of adaptive distances in the
clustering criterion function. Applications on synthetic and real-world data
corroborate the proposed procedure
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for
data analytics in modern research. They have been successfully applied to a
wide range of domains; for instance, bioinformatics, speech recognition, and
financial analysis. Formally speaking, given a set of data instances, a
clustering algorithm is expected to divide the set of data instances into the
subsets which maximize the intra-subset similarity and inter-subset
dissimilarity, where a similarity measure is defined beforehand. In this work,
the state-of-the-arts clustering algorithms are reviewed from design concept to
methodology; Different clustering paradigms are discussed. Advanced clustering
algorithms are also discussed. After that, the existing clustering evaluation
metrics are reviewed. A summary with future insights is provided at the end
- …