11,967 research outputs found
Fast NJ-like algorithms to deal with incomplete distance matrices
RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract Background Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ 12) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices. Results We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ 3 and MVR 4. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE 5. Steps (b) and (c) are also modified, but O(n3) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ*, BIONJ* and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species. Conclusion Our distance-based agglomerative algorithms NJ*, BIONJ* and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method 6 to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques 7. Binaries and all simulated data are downloadable from 8.Published versio
Low-rank semidefinite programming for the MAX2SAT problem
This paper proposes a new algorithm for solving MAX2SAT problems based on
combining search methods with semidefinite programming approaches. Semidefinite
programming techniques are well-known as a theoretical tool for approximating
maximum satisfiability problems, but their application has traditionally been
very limited by their speed and randomized nature. Our approach overcomes this
difficult by using a recent approach to low-rank semidefinite programming,
specialized to work in an incremental fashion suitable for use in an exact
search algorithm. The method can be used both within complete or incomplete
solver, and we demonstrate on a variety of problems from recent competitions.
Our experiments show that the approach is faster (sometimes by orders of
magnitude) than existing state-of-the-art complete and incomplete solvers,
representing a substantial advance in search methods specialized for MAX2SAT
problems.Comment: Accepted at AAAI'19. The code can be found at
https://github.com/locuslab/mixsa
Gaming security by obscurity
Shannon sought security against the attacker with unlimited computational
powers: *if an information source conveys some information, then Shannon's
attacker will surely extract that information*. Diffie and Hellman refined
Shannon's attacker model by taking into account the fact that the real
attackers are computationally limited. This idea became one of the greatest new
paradigms in computer science, and led to modern cryptography.
Shannon also sought security against the attacker with unlimited logical and
observational powers, expressed through the maxim that "the enemy knows the
system". This view is still endorsed in cryptography. The popular formulation,
going back to Kerckhoffs, is that "there is no security by obscurity", meaning
that the algorithms cannot be kept obscured from the attacker, and that
security should only rely upon the secret keys. In fact, modern cryptography
goes even further than Shannon or Kerckhoffs in tacitly assuming that *if there
is an algorithm that can break the system, then the attacker will surely find
that algorithm*. The attacker is not viewed as an omnipotent computer any more,
but he is still construed as an omnipotent programmer.
So the Diffie-Hellman step from unlimited to limited computational powers has
not been extended into a step from unlimited to limited logical or programming
powers. Is the assumption that all feasible algorithms will eventually be
discovered and implemented really different from the assumption that everything
that is computable will eventually be computed? The present paper explores some
ways to refine the current models of the attacker, and of the defender, by
taking into account their limited logical and programming powers. If the
adaptive attacker actively queries the system to seek out its vulnerabilities,
can the system gain some security by actively learning attacker's methods, and
adapting to them?Comment: 15 pages, 9 figures, 2 tables; final version appeared in the
Proceedings of New Security Paradigms Workshop 2011 (ACM 2011); typos
correcte
A multi-class approach for ranking graph nodes: models and experiments with incomplete data
After the phenomenal success of the PageRank algorithm, many researchers have
extended the PageRank approach to ranking graphs with richer structures beside
the simple linkage structure. In some scenarios we have to deal with
multi-parameters data where each node has additional features and there are
relationships between such features.
This paper stems from the need of a systematic approach when dealing with
multi-parameter data. We propose models and ranking algorithms which can be
used with little adjustments for a large variety of networks (bibliographic
data, patent data, twitter and social data, healthcare data). In this paper we
focus on several aspects which have not been addressed in the literature: (1)
we propose different models for ranking multi-parameters data and a class of
numerical algorithms for efficiently computing the ranking score of such
models, (2) by analyzing the stability and convergence properties of the
numerical schemes we tune a fast and stable technique for the ranking problem,
(3) we consider the issue of the robustness of our models when data are
incomplete. The comparison of the rank on the incomplete data with the rank on
the full structure shows that our models compute consistent rankings whose
correlation is up to 60% when just 10% of the links of the attributes are
maintained suggesting the suitability of our model also when the data are
incomplete
Reconstructing (super)trees from data sets with missing distances: Not all is lost
The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, present significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for datasets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial datasets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for datasets of all sizes, including large datasets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological datasets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial datasets
Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell
The genome of an organism is its complete set of DNA nucleotides, spanning
all of its genes and also of its non-coding regions. It contains most of
the information necessary to build and maintain an organism. It is therefore
no surprise that sequencing the genome provides an invaluable tool for
the scientific study of an organism. Via the inference of an evolutionary
(phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary
history of a set of species. DNA sequences, or genotype data, has
also proven useful for predicting an organismsâ phenotype (i. e. observed
traits) from its genotype. This is the objective of association studies.
While methods for finding the DNA sequence of an organism have existed
for decades, the recent advent of Next Generation Sequencing (NGS) has
meant that the availability of such data has increased to such an extent
that the computational challenges that now form an integral part of biological
studies can no longer be ignored. By focusing on phylogenetics
and Genome-Wide Association Studies (GWAS), this thesis aims to help
address some of these challenges. As a consequence this thesis is in two
parts with the first one centring on phylogenetics and the second one on
GWAS.
In the first part, we present theoretical insights for reconstructing phylogenetic
trees from incomplete distances. This problem is important in the
context of NGS data as incomplete pairwise distances between organisms
occur frequently with such input and ignoring taxa for which information
is missing can introduce undesirable bias. In the second part we focus on
the problem of inferring population stratification between individuals in a
dataset due to reproductive isolation. While powerful methods for doing
this have been proposed in the literature, they tend to struggle when faced
with the sheer volume of data that comes with NGS. To help address this
problem we introduce the novel PSIKO software and show that it scales
very well when dealing with large NGS datasets
- âŠ