Search CORE

13 research outputs found

Towards realistic benchmarks for multiple alignments of non-coding sequences

Author: A Loytynoja
A Prakash
A Prakash
A Siepel
AB Diallo
AG Clark
AP Dempster
AR Subramanian
AW Dress
B Paten
BG Hall
C Notredame
CM Bergman
D Karolchik
D Tian
DA Pollard
DA Pollard
G Bejerano
G Landan
G Landan
G Lunter
G Lunter
I Van Walle
J Felsenstein
J Kim
J Kim
J Stoye
Jaebum Kim
JD Thompson
K Katoh
K Mizuguchi
L Chindelevitch
M Blanchette
M Blanchette
M Brudno
MA Larkin
MS Rosenberg
N Bray
RA Cartwright
RC Edgar
RK Bradley
RK Bradley
S Sinha
S Snir
Saurabh Sinha
TH Ogdenw
V Simossis
W Fletcher
W Huang
W Pirovano
X He
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. Results We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments. Conclusion We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

Author: A Löytynoja
A Löytynoja
B Sipos
BG Hall
BG Hall
BP Blackburne
C Chothia
C Dessimoz
C Kemena
C Kemena
C Notredame
CB Do
CL Strope
DA Dalquen
DA Morrison
DH Mathews
ER Mardis
G Blackshields
G Jordan
G Landan
GP Raghava
I Walle Van
J Kim
J Stoye
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JH Havgaard
JP Huelsenbeck
K Mizuguchi
LA Stebbings
M Anisimova
M Pop
MR Aniba
P Gardner
RA Cartwright
RB Russell
RC Edgar
RC Edgar
SA Berger
SF Altschul
T Golubchik
T Koestler
T Lassmann
T Lassmann
T Lassmann
W Fletcher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/11/2012
Field of study

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

arXiv.org e-Print Archive

Crossref

UCL Discovery

Evaluation of methods for detecting conversion events in gene clusters

Author: A Siepel
A Siepel
C Hsu
C Spencer
C Strope
Cathy Riemer
Chih-Hao Hsu
D Husmeier
D Martin
D Martin
D Posada
E Holmes
G Hellenthal
Giltae Song
J Archer
J Archibald
J Chen
J Hein
J Huelsenbeck
J Kim
J Smith
J Stoye
K Lole
L Excoffier
L Liang
M Arenas
M Arenas
M Boni
M Gibbs
M Hasegawa
M Rosenberg
M Suchard
N Grassly
O Westesson
P Marjoram
R Cartwright
R Harris
R Hudson
S Pond
S Sawyer
S Schaffner
T Mailund
V Minin
W Miller
Webb Miller
Y Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. Results: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Conclusions: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Lessons Learned: Recommendations for Establishing Critical Periodic Scientific Benchmarking

Author: Capella-Gutierrez Salvador
de la Iglesia Diana
Dessimoz Christophe
Fernandez José M.
Gelpí Josep Lluís
Haas Juergen
Lourenco Analia
Notredame Cedric
Repchevsky Dmitry
Schwede Torsten
Valencia Alfonso
Publication venue
Publication date: 01/01/2017
Field of study

The dependence of life scientists on software has steadily grown in recent years. For many tasks, researchers have to decide which of the available bioinformatics software are more suitable for their specific needs. Additionally researchers should be able to objectively select the software that provides the highest accuracy, the best efficiency and the highest level of reproducibility when integrated in their research projects. Critical benchmarking of bioinformatics methods, tools and web services is therefore an essential community service, as well as a critical component of reproducibility efforts. Unbiased and objective evaluations are challenging to set up and can only be effective when built and implemented around community driven efforts, as demonstrated by the many ongoing community challenges in bioinformatics that followed the success of CASP. Community challenges bring the combined benefits of intense collaboration, transparency and standard harmonization. Only open systems for the continuous evaluation of methods offer a perfect complement to community challenges, offering to larger communities of users that could extend far beyond the community of developers, a window to the developments status that they can use for their specific projects. We understand by continuous evaluation systems as those services which are always available and periodically update their data and/or metrics according to a predefined schedule keeping in mind that the performance has to be always seen in terms of each research domain. We argue here that technology is now mature to bring community driven benchmarking efforts to a higher level that should allow effective interoperability of benchmarks across related methods. New technological developments allow overcoming the limitations of the first experiences on online benchmarking e.g. EVA. We therefore describe OpenEBench, a novel infra-structure designed to establish a continuous automated benchmarking system for bioinformatics methods, tools and web services. OpenEBench is being developed so as to cater for the needs of the bioinformatics community, especially software developers who need an objective and quantitative way to inform their decisions as well as the larger community of end-users, in their search for unbiased and up-to-date evaluation of bioinformatics methods. As such OpenEBench should soon become a central place for bioinformatics software developers, community-driven benchmarking initiatives, researchers using bioinformatics methods, and funders interested in the result of methods evaluation.Preprin

UPCommons. Portal del coneixement obert de la UPC

Use of ChIP-Seq data for the design of a multiple promoter-alignment method

Author: Altenhoff
Althammer
Aniba
Bais
Berezikov
Blanco
Blanco
Bulyk
Bussotti
Carroll
Chenna
Cédric Notredame
Do
Edgar
Eduardo Eyras
Enrique Blanco
Farnham
Flicek
Giovanni Bussotti
Hallikas
He
Henikoff
Hertz
Huang
Ionas Erb
Juan R. González-Vallinas
Katoh
Keightley
Kellis
Kemena
Kim
Kumar
Loots
Lu
Majoros
Matys
Moses
Notredame
Otto
Ovcharenko
Parker
Pollard
Portales-Casamar
Prohaska
Schmidt
Siddharthan
Siddharthan
Sinha
Su
Thompson
Xie
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments

Crossref

PubMed Central

UPF Digital Repository

Sigma-2: Multiple sequence alignment of non-coding DNA via an evolutionary model

Author: A Subramanian
B Morgenstern
B Morgenstern
B Paten
C Notredame
C Peng
DN Cooper
E Segal
F Rodríguez
G Baele
Gayathri Jayaraman
GD Stormo
GR Reeck
GZ Hertz
J Felsenstein
J Kim
J Pei
J Thorne
J Zhu
JD Thompson
JL Thorne
K Tamura
K Tamura
M Brudno
M Hasegawa
M Kimura
M Kimura
M Larkin
M Steel
N Bray
PF Arndt
R Siddharthan
R Siddharthan
R Siddharthan
Rahul Siddharthan
RC Edgar
RK Bradley
S Padmanabhan
S Sinha
S Tavaré
T Jukes
T Lassmann
T Uzzell
TF Smith
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background While most multiple sequence alignment programs expect that all or most of their input is known to be homologous, and penalise insertions and deletions, this is not a reasonable assumption for non-coding DNA, which is much less strongly conserved than protein-coding genes. Arguing that the goal of sequence alignment should be the detection of <it>homology </it>and not <it>similarity</it>, we incorporate an evolutionary model into a previously published multiple sequence alignment program for non-coding DNA, Sigma, as a sensitive likelihood-based way to assess the significance of alignments. Version 1 of Sigma was successful in eliminating spurious alignments but exhibited relatively poor sensitivity on synthetic data. Sigma 1 used a <it>p</it>-value (the probability under the "null hypothesis" of non-homology) to assess the significance of alignments, and, optionally, a background model that captured short-range genomic correlations. Sigma version 2, described here, retains these features, but calculates the <it>p</it>-value using a sophisticated evolutionary model that we describe here, and also allows for a transition matrix for different substitution rates from and to different nucleotides. Our evolutionary model takes separate account of mutation and fixation, and can be extended to allow for locally differing functional constraints on sequence. Results We demonstrate that, on real and synthetic data, Sigma-2 significantly outperforms other programs in specificity to genuine homology (that is, it minimises alignment of spuriously similar regions that do not have a common ancestry) while it is now as sensitive as the best current programs. Conclusions Comparing these results with an extrapolation of the best results from other available programs, we suggest that conservation rates in intergenic DNA are often significantly over-estimated. It is increasingly important to align non-coding DNA correctly, in regulatory genomics and in the context of whole-genome alignment, and Sigma-2 is an important step in that direction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Database: The Journal of Biological Databases and Curation

Author: Amode R
Beal K
Brent S
Fitzgerald S
Flicek P
Gordon L
Herrero J
Kulesha E
Muffato M
Pignatelli M
Searle SM
Spooner W
Vilella AJ
Yates A
Publication venue
Publication date: 01/01/2016
Field of study

Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available.Database URL: http://www.ensembl.org

UCL Discovery

Universal mitochondrial multi-locus sequence analysis (mtMLSA) to characterise populations of unanticipated plant pest biosecurity detections

Author: Armstrong Karen
Hiszczynska-Sawicka E
Li D
Publication venue: 'MDPI AG'
Publication date: 01/04/2022
Field of study

Biosecurity responses to post-border exotic pest detections are more effective with knowledge of where the species may have originated from or if recurrent detections are connected. Population genetic markers for this are typically species-specific and not available in advance for any but the highest risk species, leaving other less anticipated species difficult to assess at the time. Here, new degenerate PCR primer sets are designed for within the Lepidoptera and Diptera for the 3′ COI, ND3, ND6, and 3′ plus 5′ 16S gene regions. These are shown to be universal at the ordinal level amongst species of 14 and 15 families across 10 and 11 dipteran and lepidopteran superfamilies, respectively. Sequencing the ND3 amplicons as an example of all the loci confirmed detection of population-level variation. This supported finding multiple population haplotypes from the publicly available sequences. Concatenation of the sequences also confirmed that higher population resolution is achieved than for the individual genes. Although as-yet untested in a biosecurity situation, this method is a relatively simple, off-the-shelf means to characterise populations. This makes a proactive contribution to the toolbox of quarantine agencies at the time of detection without the need for unprepared species-specific research and development

Directory of Open Access Journals

Lincoln University Research Archive

PubMed Central

Context-specific methods for sequence homology searching and alignment

Author: Biegert Andreas
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2010
Field of study

Digitale Hochschulschriften der LMU

MPG.PuRe