4,022 research outputs found
Cluster validation by measurement of clustering characteristics relevant to the user
There are many cluster analysis methods that can produce quite different
clusterings on the same dataset. Cluster validation is about the evaluation of
the quality of a clustering; "relative cluster validation" is about using such
criteria to compare clusterings. This can be used to select one of a set of
clusterings from different methods, or from the same method ran with different
parameters such as different numbers of clusters.
There are many cluster validation indexes in the literature. Most of them
attempt to measure the overall quality of a clustering by a single number, but
this can be inappropriate. There are various different characteristics of a
clustering that can be relevant in practice, depending on the aim of
clustering, such as low within-cluster distances and high between-cluster
separation.
In this paper, a number of validation criteria will be introduced that refer
to different desirable characteristics of a clustering, and that characterise a
clustering in a multidimensional way. In specific applications the user may be
interested in some of these criteria rather than others. A focus of the paper
is on methodology to standardise the different characteristics so that users
can aggregate them in a suitable way specifying weights for the various
criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure
Habitat filtering determines spatial variation of macroinvertebrate community traits in northern headwater streams
Although our knowledge of the spatial distribution of stream organisms has been increasing rapidly in the last decades, there is still little consensus about trait-based variability of macroinvertebrate communities within and between catchments in near-pristine systems. Our aim was to examine the taxonomic and trait based stability vs. variability of stream macroinvertebrates in three high-latitude catchments in Finland. The collected taxa were assigned to unique trait combinations (UTCs) using biological traits. We found that only a single or a highly limited number of taxa formed a single UTC, suggesting a low degree of redundancy. Our analyses revealed significant differences in the environmental conditions of the streams among the three catchments. Linear models, rarefaction curves and beta-diversity measures showed that the catchments differed in both alpha and beta diversity. Taxon- and trait-based multivariate analyses also indicated that the three catchments were significantly different in terms of macroinvertebrate communities. All these findings suggest that habitat filtering, i.e., environmental differences among catchments, determines the variability of macroinvertebrate communities, thereby contributing to the significant biological differences among the catchments. The main implications of our study is that the sensitivity of trait-based analyses to natural environmental variation should be carefully incorporated in the assessment of environmental degradation, and that further studies are needed for a deeper understanding of trait-based community patterns across near-pristine streams
Data granulation by the principles of uncertainty
Researches in granular modeling produced a variety of mathematical models,
such as intervals, (higher-order) fuzzy sets, rough sets, and shadowed sets,
which are all suitable to characterize the so-called information granules.
Modeling of the input data uncertainty is recognized as a crucial aspect in
information granulation. Moreover, the uncertainty is a well-studied concept in
many mathematical settings, such as those of probability theory, fuzzy set
theory, and possibility theory. This fact suggests that an appropriate
quantification of the uncertainty expressed by the information granule model
could be used to define an invariant property, to be exploited in practical
situations of information granulation. In this perspective, a procedure of
information granulation is effective if the uncertainty conveyed by the
synthesized information granule is in a monotonically increasing relation with
the uncertainty of the input data. In this paper, we present a data granulation
framework that elaborates over the principles of uncertainty introduced by
Klir. Being the uncertainty a mesoscopic descriptor of systems and data, it is
possible to apply such principles regardless of the input data type and the
specific mathematical setting adopted for the information granules. The
proposed framework is conceived (i) to offer a guideline for the synthesis of
information granules and (ii) to build a groundwork to compare and
quantitatively judge over different data granulation procedures. To provide a
suitable case study, we introduce a new data granulation technique based on the
minimum sum of distances, which is designed to generate type-2 fuzzy sets. We
analyze the procedure by performing different experiments on two distinct data
types: feature vectors and labeled graphs. Results show that the uncertainty of
the input data is suitably conveyed by the generated type-2 fuzzy set models.Comment: 16 pages, 9 figures, 52 reference
Earnings efficiency and poverty dominance analysis: a spatial approach
The paper estimates an earnings frontier by the method of Corrected Ordinary Least Squares (COLS) and categorizes households as efficient or inefficient based on some benchmark efficiency score and the estimated frontier. The spatial distribution of the poor and non poor households is then explored by construction of a poverty segregation curve across efficiency zones. Robust poverty comparisons across the efficient and inefficient groups reveal that poverty is in fact higher for the efficient group compared to the inefficient one. The paper thus indirectly supports the “poor but efficient hypothesisâ€.Earnings Frontier, Poverty, Stochastic Dominance, Treatment Effect.
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
- …