13,788 research outputs found
Clustering DNA words through distance distributions
Functional data appear in several domains of science, for example, in biomedical,
meteorologic or engineering studies. A functional observation can exhibit an atypical
behaviour during a short or a large part of the domain and this may be due to
magnitude or to shape features. Over the last ten years many outlier detection
methods have been proposed. In this work we use the functional data framework to
investigate the existence of DNA words with outlying distance distribution, which
may be related with biological motifs.
A DNA word is a sequence defined in the genome alphabet {ACGT}. Distances between successive occurrences of the same word allow defining the inter-word distance
distribution, interpretable as a discrete function. Each word length is associated
with a functional dataset formed by 4
distance distributions. As the word length
increases, greater is the diversity of observed patterns in the functional dataset and
larger is the number of distributions displaying strong peaks of frequency. We propose a two-step procedure to detect words with an outlying pattern of distances: first, the functions are clustered according to their global trend; then, an
outlier detection method is applied within each cluster. Each distribution trend is
obtained by data smoothing, which avoids some distributionsâ peaks, and similarities
between smoothed data are explored through hierarchical complete linkage clustering. The dissimilarity between functions is evaluated using the Euclidean distance
or the Generalized Minimum distance [1], which considers the dependence between
domain points. The resulting dendograms are then cut leading to a partition of the
distance distributions. For the second step we use the Directional Outlyingness measure which assigns a robust measure of outlyingness to each domain point and is the
building block of a graphical tool for visualization of the centrality of the curves [2].
We focus on the human genome and words of length †7. Results are compared
with those obtained by applying only the second step of the procedure [3].publishe
On preprocessing of speech signals
Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 Ï edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouragin
Trust-Based Fusion of Untrustworthy Information in Crowdsourcing Applications
In this paper, we address the problem of fusing untrustworthy reports provided from a crowd of observers, while simultaneously learning the trustworthiness of individuals. To achieve this, we construct a likelihood model of the userss trustworthiness by scaling the uncertainty of its multiple estimates with trustworthiness parameters. We incorporate our trust model into a fusion method that merges estimates based on the trust parameters and we provide an inference algorithm that jointly computes the fused output and the individual trustworthiness of the users based on the maximum likelihood framework. We apply our algorithm to cell tower localisation using real-world data from the OpenSignal project and we show that it outperforms the state-of-the-art methods in both accuracy, by up to 21%, and consistency, by up to 50% of its predictions. Copyright © 2013, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
- âŠ