817 research outputs found
Detection of recombination in DNA multiple alignments with hidden markov models
CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected
Designing a Belief Function-Based Accessibility Indicator to Improve Web Browsing for Disabled People
The purpose of this study is to provide an accessibility measure of
web-pages, in order to draw disabled users to the pages that have been designed
to be ac-cessible to them. Our approach is based on the theory of belief
functions, using data which are supplied by reports produced by automatic web
content assessors that test the validity of criteria defined by the WCAG 2.0
guidelines proposed by the World Wide Web Consortium (W3C) organization. These
tools detect errors with gradual degrees of certainty and their results do not
always converge. For these reasons, to fuse information coming from the
reports, we choose to use an information fusion framework which can take into
account the uncertainty and imprecision of infor-mation as well as divergences
between sources. Our accessibility indicator covers four categories of
deficiencies. To validate the theoretical approach in this context, we propose
an evaluation completed on a corpus of 100 most visited French news websites,
and 2 evaluation tools. The results obtained illustrate the interest of our
accessibility indicator
Electrostatic Field Classifier for Deficient Data
This paper investigates the suitability of recently developed models based on the physical
field phenomena for classification problems with incomplete datasets. An original approach
to exploiting incomplete training data with missing features and labels, involving extensive use
of electrostatic charge analogy, has been proposed. Classification of incomplete patterns has been
investigated using a local dimensionality reduction technique, which aims at exploiting all available
information rather than trying to estimate the missing values. The performance of all proposed
methods has been tested on a number of benchmark datasets for a wide range of missing data scenarios
and compared to the performance of some standard techniques. Several modifications of the
original electrostatic field classifier aiming at improving speed and robustness in higher dimensional
spaces are also discussed
Evidential-EM Algorithm Applied to Progressively Censored Observations
Evidential-EM (E2M) algorithm is an effective approach for computing maximum
likelihood estimations under finite mixture models, especially when there is
uncertain information about data. In this paper we present an extension of the
E2M method in a particular case of incom-plete data, where the loss of
information is due to both mixture models and censored observations. The prior
uncertain information is expressed by belief functions, while the
pseudo-likelihood function is derived based on imprecise observations and prior
knowledge. Then E2M method is evoked to maximize the generalized likelihood
function to obtain the optimal estimation of parameters. Numerical examples
show that the proposed method could effectively integrate the uncertain prior
infor-mation with the current imprecise knowledge conveyed by the observed
data
Diagonal and Low-Rank Matrix Decompositions, Correlation Matrices, and Ellipsoid Fitting
In this paper we establish links between, and new results for, three problems
that are not usually considered together. The first is a matrix decomposition
problem that arises in areas such as statistical modeling and signal
processing: given a matrix formed as the sum of an unknown diagonal matrix
and an unknown low rank positive semidefinite matrix, decompose into these
constituents. The second problem we consider is to determine the facial
structure of the set of correlation matrices, a convex set also known as the
elliptope. This convex body, and particularly its facial structure, plays a
role in applications from combinatorial optimization to mathematical finance.
The third problem is a basic geometric question: given points
(where ) determine whether there is a centered
ellipsoid passing \emph{exactly} through all of the points.
We show that in a precise sense these three problems are equivalent.
Furthermore we establish a simple sufficient condition on a subspace that
ensures any positive semidefinite matrix with column space can be
recovered from for any diagonal matrix using a convex
optimization-based heuristic known as minimum trace factor analysis. This
result leads to a new understanding of the structure of rank-deficient
correlation matrices and a simple condition on a set of points that ensures
there is a centered ellipsoid passing through them.Comment: 20 page
SACOC: A spectral-based ACO clustering algorithm
The application of ACO-based algorithms in data mining is growing over the last few years and several supervised and unsupervised learning algorithms have been developed using this bio-inspired approach. Most recent works concerning unsupervised learning have been focused on clustering, where ACO-based techniques have showed a great potential. At the same time, new clustering techniques that seek the continuity of data, specially focused on spectral-based approaches in opposition to classical centroid-based approaches, have attracted an increasing research interest–an area still under study by ACO clustering techniques. This work presents a hybrid spectral-based ACO clustering algorithm inspired by the ACO Clustering (ACOC) algorithm. The proposed approach combines ACOC with the spectral Laplacian to generate a new search space for the algorithm in order to obtain more promising solutions. The new algorithm, called SACOC, has been compared against well-known algorithms (K-means and Spectral Clustering) and with ACOC. The experiments measure the accuracy of the algorithm for both synthetic datasets and real-world datasets extracted from the UCI Machine Learning Repository
Classification of Message Spreading in a Heterogeneous Social Network
Nowadays, social networks such as Twitter, Facebook and LinkedIn become
increasingly popular. In fact, they introduced new habits, new ways of
communication and they collect every day several information that have
different sources. Most existing research works fo-cus on the analysis of
homogeneous social networks, i.e. we have a single type of node and link in the
network. However, in the real world, social networks offer several types of
nodes and links. Hence, with a view to preserve as much information as
possible, it is important to consider so-cial networks as heterogeneous and
uncertain. The goal of our paper is to classify the social message based on its
spreading in the network and the theory of belief functions. The proposed
classifier interprets the spread of messages on the network, crossed paths and
types of links. We tested our classifier on a real word network that we
collected from Twitter, and our experiments show the performance of our belief
classifier
Dynamic Bayesian Combination of Multiple Imperfect Classifiers
Classifier combination methods need to make best use of the outputs of
multiple, imperfect classifiers to enable higher accuracy classifications. In
many situations, such as when human decisions need to be combined, the base
decisions can vary enormously in reliability. A Bayesian approach to such
uncertain combination allows us to infer the differences in performance between
individuals and to incorporate any available prior knowledge about their
abilities when training data is sparse. In this paper we explore Bayesian
classifier combination, using the computationally efficient framework of
variational Bayesian inference. We apply the approach to real data from a large
citizen science project, Galaxy Zoo Supernovae, and show that our method far
outperforms other established approaches to imperfect decision combination. We
go on to analyse the putative community structure of the decision makers, based
on their inferred decision making strategies, and show that natural groupings
are formed. Finally we present a dynamic Bayesian classifier combination
approach and investigate the changes in base classifier performance over time.Comment: 35 pages, 12 figure
- …