Search CORE

41 research outputs found

BIGMAC : breaking inaccurate genomes and merging assembled contigs for long read metagenomic assembly.

Author: Clum Alicia
Hall Richard
Lam Ka-Kit
Rao Satish
Publication venue: eScholarship, University of California
Publication date: 01/10/2016
Field of study

BackgroundThe problem of de-novo assembly for metagenomes using only long reads is gaining attention. We study whether post-processing metagenomic assemblies with the original input long reads can result in quality improvement. Previous approaches have focused on pre-processing reads and optimizing assemblers. BIGMAC takes an alternative perspective to focus on the post-processing step.ResultsUsing both the assembled contigs and original long reads as input, BIGMAC first breaks the contigs at potentially mis-assembled locations and subsequently scaffolds contigs. Our experiments on metagenomes assembled from long reads show that BIGMAC can improve assembly quality by reducing the number of mis-assemblies while maintaining or increasing N50 and N75. Moreover, BIGMAC shows the largest N75 to number of mis-assemblies ratio on all tested datasets when compared to other post-processing tools.ConclusionsBIGMAC demonstrates the effectiveness of the post-processing approach in improving the quality of metagenomic assemblies

PubMed Central

eScholarship - University of California

Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis

Author: Daskalakis Constantinos
Roch Sebastien
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/02/2012
Field of study

We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.Comment: Published in at http://dx.doi.org/10.1214/12-AAP852 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Author: Andoni Alexandr
Daskalakis Constantinos
Hassidim Avinatan
Roch Sebastien
Publication venue
Publication date: 01/01/2009
Field of study

Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences--an important problem which has no rigorous model-based solution. In this paper we consider a new version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following {\em trace reconstruction problem on a tree} (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

SDT: a virus classification tool based on pairwise sequence alignment and identity calculation

Author: Martin Darren Patrick
Muhire Brejnev Muhizi
Varsani Arvind
Publication venue: Department of Clinical Laboratory Sciences
Publication date: 12/08/2016
Field of study

The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms)

CiteSeerX

Cape Town University OpenUCT

ThIEF: Finding Genome-wide Trajectories of Epigenetics Marks

Author: Bunnik Evelien M.
Hasan Md. Abid
Le Roch Karine
Lonardi Stefano
Pan Weihua
Polishko Anton
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

We address the problem of comparing multiple genome-wide maps representing nucleosome positions or specific histone marks. These maps can originate from the comparative analysis of ChIP-Seq/MNase-Seq/FAIRE-Seq data for different cell types/tissues or multiple time points. The input to the problem is a set of maps, each of which is a list of genomics locations for nucleosomes or histone marks. The output is an alignment of nucleosomes/histone marks across time points (that we call trajectories), allowing small movements and gaps in some of the maps. We present a tool called ThIEF (TrackIng of Epigenetic Features) that can efficiently compute these trajectories. ThIEF comes into two "flavors": ThIEF:Iterative finds the trajectories progressively using bipartite matching, while ThIEF:LP solves a k-partite matching problem on a hyper graph using linear programming. ThIEF:LP is guaranteed to find the optimal solution, but it is slower than ThIEF:Iterative. We demonstrate the utility of ThIEF by providing an example of applications on the analysis of temporal nucleosome maps for the human malaria parasite. As a surprisingly remarkable result, we show that the output of ThIEF can be used to produce a supervised classifier that can accurately predict the position of stable nucleosomes (i.e., nucleosomes present in all time points) and unstable nucleosomes (i.e., present in at most half of the time points) from the primary DNA sequence. To the best of our knowledge, this is the first result on the prediction of the dynamics of nucleosomes solely based on their DNA binding preference. Software is available at https://github.com/ucrbioinfo/ThIEF

Dagstuhl Research Online Publication Server

Un enfoque Multi-Objetivo a la optimización del Alineamiento Múltiple de Secuencias (MSA)

Author: Aguirre-Pérez Ricardo
Cárdenas-Zea Miriam
Zambrano-Vega Cristian
Publication venue: 'Escuela Politecnica Nacional'
Publication date: 20/05/2016
Field of study

Multiple Sequence Alignment (MSA) is one of the main topics in the bioinformatics domain, consists in ﬁnding an optimal alignment for three or more biological sequences with the number maximum of conserved zones or totally aligned columns. Different scores to assess the quality of the alignments have been proposed, so the problem can be formulated and resolved as a Multi-Objective Optimization Problem (MOP). For this reason we have carried out a perfomanced study resolving the MSA problem under a multi-objective approach, considering two popular metrics as objectives to be optimized: The weighted Sum-Of-Pairs with afﬁne gap penalties (wSOP) and the Totally Aligned Columns (TC), with three algorithms from the state-of- the-art of Multi-Objective Optimization: NSGAII, SPEA2 and MOCell. Our experiments reveals that the classic metaheuristic NSGA-II provides the best overall performance resolving some problems provided by the benchmark BAliBASE (v3.0), under a multi-objective and biological approach

Latin American Journal of Computing

An exact mathematical programming approach to multiple RNA sequence-structure alignment

Author: Bauer Markus
Klau Gunnar W.
Reinert Knut
Publication venue
Publication date: 01/01/2007
Field of study

One of the main tasks in computational biology is the computation of alignments of genomic sequences to reveal their commonalities. In case of DNA or protein sequences, sequence information alone is usually sufficient to compute reliable alignments. RNA molecules, however, build spatial conformations—the secondary structure—that are more conserved than the actual sequence. Hence, computing reliable alignments of RNA molecules has to take into account the secondary structure. We present a novel framework for the computation of exact multiple sequence-structure alignments: We give a graph- theoretic representation of the sequence-structure alignment problem and phrase it as an integer linear program. We identify a class of constraints that make the problem easier to solve and relax the original integer linear program in a Lagrangian manner. Experiments on a recently published benchmark show that our algorithms has a comparable performance than more costly dynamic programming algorithms, and outperforms all other approaches in terms of solution quality with an increasing number of input sequences

University of New Brunswick: Centre for Digital Scholarship Journals

Institutional Repository of the Freie Universität Berlin

CiteSeerX

CWI's Institutional Repository

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Looking for the Last Universal Common Ancestor (LUCA)

Author: Annila Arto
Koskela Minna Maria
Publication venue
Publication date: 01/01/2012
Field of study

Genomic sequences across diverse species seem to align towards a common ancestry, eventually implying that eons ago some universal antecedent organism would have lived on the face of Earth. However, when evolution is understood not only as a biological process but as a general thermodynamic process, it becomes apparent that the quest for the last universal common ancestor is unattainable. Ambiguities in alignments are unavoidable because the driving forces and paths of evolution cannot be separated from each other. Thus tracking down life’s origin is by its nature a non-computable task. The thermodynamic tenet clarifies that evolution is a path-dependent process of least-time consumption of free energy. The natural process is without a demarcation line between animate and inanimate.Peer reviewe

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Helsingin yliopiston digitaalinen arkisto