Search CORE

GAM-NGS: genomic assemblies merger for next generation sequencing

Author: Lars Arvestad
Policriti Alberto
Scalabrin Simone
Vezzi Francesco
Vicedomini Riccardo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Background: In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions.Results: GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools.Conclusions: The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct

Archivio istituzionale della ricerca - Università degli Studi di Udine

PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions

Author: Arvestad
Blanchette
Brent
Butler
Clark
Goldman
Guttman
Guttman
Holmes
I. Jungreis
Kellis
Lin
M. F. Lin
M. Kellis
Ota
Ozsolak
Stark
Whelan
Yang
Publication venue
Publication date: 17/08/2010
Field of study

As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species _Drosophila_ genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE

primetv: a viewer for reconciled trees

Author: Arvestad Lars
Berglund Sonnhammer Ann-Charlotte
Lagergren Jens
Schreil Eva
Sennblad Bengt
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

GenPhyloData: realistic simulation of gene family evolution

Author: Bengt Sennblad
Jens Lagergren
Joel Sjöstrand
Lars Arvestad
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

arXiv.org e-Print Archive

Back-translation for discovering distant protein homologies

Author: A. Pedersen
B. Oostra
C. Kosiol
J. Leluk
J. Leluk
J. Raes
K. Okamura
L. Arvestad
L. Delaye
M. Clamp
M. Pellegrini
P. Harrison
P. Lio
R. Blake
S. Altschul
S. Altschul
S. Altschul
Y. Hahn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Maximum likelihood models and algorithms for gene tree evolution with duplications and losses

Author: AP Martin
B Ma
Gordon J Burleigh
J Ruan
JA Cotton
JA Cotton
JB Slowinski
JH Degnan
JP Demuth
JP Doyon
JP Doyon
JS Taylor
L Arvestad
L Arvestad
L Arvestad
L Liu
L Zhang
M Goodman
M Lynch
MA Bender
MJ Sanderson
MR Garey
MR McGowen
O Akerborg
Oliver Eulenstein
P Górecki
P Górecki
P Górecki
Pawel Górecki
R Redon
RD Page
RDM Page
S Ohno
SB Hedges
W Maddison
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The abundance of new genomic data provides the opportunity to map the location of gene duplication and loss events on a species phylogeny. The first methods for mapping gene duplications and losses were based on a parsimony criterion, finding the mapping that minimizes the number of duplication and loss events. Probabilistic modeling of gene duplication and loss is relatively new and has largely focused on birth-death processes. Results We introduce a new maximum likelihood model that estimates the speciation and gene duplication and loss events in a gene tree within a species tree with branch lengths. We also provide an, in practice, efficient algorithm that computes optimal evolutionary scenarios for this model. We implemented the algorithm in the program DrML and verified its performance with empirical and simulated data. Conclusions In test data sets, DrML finds optimal gene duplication and loss scenarios within minutes, even when the gene trees contain sequences from several hundred species. In many cases, these optimal scenarios differ from the lca-mapping that results from a parsimony gene tree reconciliation. Thus, DrML provides a new, practical statistical framework on which to study gene duplication.</p

Directory of Open Access Journals

The Francis Crick Institute

Kajian Bentuk Dan Sensitivitas Rumus Indeks Pi, Storet, Ccme Untuk Penentuan Status Mutu Perairan Sungai Tropis Di Indonesia (Assessment of the Forms and Sensitivity of the Index Formula Pi, Storet, Ccme for the Determination of Water Quality Status)

Author: Joel Sjรถstrand (3741718)
Jorge Mirรณ (3741727)
Lars Arvestad (28085)
Mikael Bark (3741721)
Raja Abbas (3741724)
Raja Ali (3575492)
Sayyed Muhammad (3575495)
Syed Zubair (3741730)
Publication venue: Gadjah Mada University
Publication date: 01/07/2014
Field of study

Metode-metode Pollution Index (USA), metode Storet (USA) dan metode CCME (Canada) adalah metode indeks kualitas air (IKA) untuk penentuan status mutu air. Dua yang pertama banyak digunakan praktisi lingkungan di Indonesia karena dirujuk dalam Keputusan Menteri Lingkungan Hidup No. 115/2013. Ketiganya dapat menghitung IKA dengan baku mutu kualitas air lokal sungai kajian. Mengingat negara penyusun metode tersebut berbeda kondisi lingkungannya dan masing-masing metode mempunyai faktor spesifik untuk menghitung IKA, maka perlu dikaji kesesuaian masing-masing metode untuk diterapkan di sungai tropis Indonesia. Masing-masing metode akan dikaji bentuk persamaan dan sensitivitasnya dengan menggunakan banyak parameter kualitas air dan menggunakan jumlah parameter kualitas air tertentu mengacu pada metode IKA yang dikembangkan di negara tropis lainnya. Kajian menggunakan data pemantauan “Prokasih” di sungai Gadjah Wong Yogyakarta tahun 1996/1997 - 2011/2012. Penelitian ini dilakukan dalam rangka menyusun metode IKA sungai tropis Indonesia pada umumnya dan di sungai Gadjah Wong khususnya serta program pengelolaan kualitas air untuk pengendalian pencemaran air sungai, dengan target konservasi air sungai yang multifungsi atau overall/general use(memenuhi kriteria kesehatan air baku, memenuhi kriteria estetika serta kriteria ekologi/aman bagi kehidupan di perairan). Hasil kajian menunjukkan bahwa dibandingkan 2 metode lainnya, metode CCME dinilai paling obyektif (secara statistik) menghitung IKA perairan sungai Gadjah Wong. CCME paling sensitif merespon dinamika indeks mutu air di setiap lokasi pemantauan, lebih universal untuk dapat diaplikasikan di luar negara penyusunnya. Namun untuk diaplikasikan di sungai Gadjah Wong, metode CCME perlu diadaptasi terhadap beberapa hal yaitu jumlah dan jenis parameter kualitas air yang dianggap signifikan, jumlah dan kelas mutu air. Adaptasi mempertimbangkan program pengendalian pencemaran air dan strategi operasional/manajemen aliran sungai yang ekologis dan berkelanjutan. Skor batas dan makna setiap kelas mutu air dalam IKA harus diverifikasi terhadap data lingkungan lain misal hasil biotilik ataupun bioassay sehingga status indeks kualitas air tidak bertentangan dengan kondisi biologi di sungai. Pelibatan parameter bakteriologi kualitas air (Escherichia Coli dan Total Coliform) serta Electric Conductivity/EC sebagai parameter kualitas air signifikan dalam metode IKA masih perlu dikaji lebih lanjut untuk pengembangan metode IKA khas perairan sungai di negara tropis Indonesia

Neliti

Fast computation of distance estimators

Author: A Rambaut
D Swofford
F Barker
H Kishino
I Elias
Isaac Elias
J Felsenstein
J Felsenstein
Jens Lagergren
K Tamura
K Tuplin
L Arvestad
M Kimura
N Saitou
T Jukes
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n(3)). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n(2). Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. RESULTS: We give an advanced algorithm for Computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity Symbols. This new method is shown to be more accurate as well as faster than earlier methods. CONCLUSION: Our novel algorithm for Computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds

Directory of Open Access Journals