Search CORE

19 research outputs found

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Author: Blackshields Gordon
Higgins Desmond G
Shi Weifeng
Sievers Fabian
Wilm Andreas
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to <it>N</it>2 for <it>N </it>sequences. When <it>N </it>grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from <url>http://www.clustal.org/mbed.tgz</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Multidimensional Scaling Reveals the Main Evolutionary Pathways of Class A G-Protein-Coupled Receptors

Author: A Gogos
A Rokas
A Rokas
B Wu
C Kuiken
C Tuffley
David Thybert
DE Gloriam
DG Higgins
DJ Smith
DK Vassilatis
E Susko
F Lu
F Murtagh
G Blackshields
G Casari
H Abdi
H Abdi
H Rompler
HC Wang
Hervé Abdi
I Domazet
I Kass
IG Choi
J Devillé
J Hou
J Hou
J Tzeng
JC Gower
JH Park
JS Surgand
Julien Pelé
JW DeLano
K Palczewski
K Ye
KB Nicholas
KJ Woolley
Kolakowski LF Jr
M Anctil
M Greenacre
M Nei
MA Larkin
Marie Chabbert
Matthieu Moreau
MC Peeters
MW Trosset
P Rousseeuw
P Scheerer
R Fredriksson
R Fredriksson
RA Studer
RP Metpally
S Yohannan
S Yohannan
SC Sealfon
SG Rasmussen
T Haitina
U Gether
V Cherezov
Vladimir N. Uversky
VN Grishin
W Shi
WM Fitch
WS Togerson
WS Valdar
Y Takane
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Class A G-protein-coupled receptors (GPCRs) constitute the largest family of transmembrane receptors in the human genome. Understanding the mechanisms which drove the evolution of such a large family would help understand the specificity of each GPCR sub-family with applications to drug design. To gain evolutionary information on class A GPCRs, we explored their sequence space by metric multidimensional scaling analysis (MDS). Three-dimensional mapping of human sequences shows a non-uniform distribution of GPCRs, organized in clusters that lay along four privileged directions. To interpret these directions, we projected supplementary sequences from different species onto the human space used as a reference. With this technique, we can easily monitor the evolutionary drift of several GPCR sub-families from cnidarians to humans. Results support a model of radiative evolution of class A GPCRs from a central node formed by peptide receptors. The privileged directions obtained from the MDS analysis are interpretable in terms of three main evolutionary pathways related to specific sequence determinants. The first pathway was initiated by a deletion in transmembrane helix 2 (TM2) and led to three sub-families by divergent evolution. The second pathway corresponds to the differentiation of the amine receptors. The third pathway corresponds to parallel evolution of several sub-families in relation with a covarion process involving proline residues in TM2 and TM5. As exemplified with GPCRs, the MDS projection technique is an important tool to compare orthologous sequence sets and to help decipher the mutational events that drove the evolution of protein families

Public Library of Science (PLOS)

CiteSeerX

Crossref

Directory of Open Access Journals

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Author: A Lempel
A Puglisi
Andrew K Benson
CG Nevill-Manning
David J Russell
DR Bastola
E Ukkonen
EK Costello
EM McCreight
HH Otu
J Ziv
J Ziv
JD Parsons
JD Thompson
Khalid Sayood
L Holm
M Charikar
M Halkidi
P Weiner
RC Edgar
Samuel F Way
SF Altschul
W Li
W Li
W Li
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

Crossref

DigitalCommons@University of Nebraska

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Multiobjective characteristic-based framework for very-large multiple sequence alignment

Author: Castelli Mauro
Rubio-Largo Álvaro
Vanneschi Leonardo
Vega-Rodríguez Miguel A.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

Rubio-Largo, Á., Vanneschi, L., Castelli, M., & Vega-Rodríguez, M. A. (2018). Multiobjective characteristic-based framework for very-large multiple sequence alignment. Applied Soft Computing Journal, 69, 719-736. [Advanced online publication on 27 June 2017]DOI: 10.1016/j.asoc.2017.06.022In the literature, we can find several heuristics for solving the multiple sequence alignment problem. The vast majority of them makes use of flags in order to modify certain alignment parameters; however, if no flags are used, the aligner will run with the default parameter configuration, which, often, is not the optimal one. In this work, we propose a framework that, depending on the biological characteristics of the input dataset, runs the aligner with the best parameter configuration found for another dataset that has similar biological characteristics, improving the accuracy and conservation of the obtained alignment. To train the framework, we use three well-known multiobjective evolutionary algorithms: NSGA-II, IBEA, and MOEA/D. Then, we perform a comparative study between several aligners proposed in the literature and the characteristic-based version of Kalign, MAFFT, and MUSCLE, when solving widely-used benchmarks (PREFAB v4.0 and SABmark v1.65) and very-large benchmarks with thousands of unaligned sequences (HomFam).authorsversionpublishe

Crossref

Repositório da Universidade Nova de Lisboa

Recommended from our members

The Short- and Long-Range RNA-RNA Interactome of SARS-CoV-2.

Author: Goodfellow Ian
Kamenova Tsveta
Miska Eric A
Price Jonathan
Shalamova Lyudmila
Weber Friedemann
Ziv Omer
Publication venue: Mol Cell
Publication date: 17/12/2020
Field of study

The Coronaviridae is a family of positive-strand RNA viruses that includes SARS-CoV-2, the etiologic agent of the COVID-19 pandemic. Bearing the largest single-stranded RNA genomes in nature, coronaviruses are critically dependent on long-distance RNA-RNA interactions to regulate the viral transcription and replication pathways. Here we experimentally mapped the in vivo RNA-RNA interactome of the full-length SARS-CoV-2 genome and subgenomic mRNAs. We uncovered a network of RNA-RNA interactions spanning tens of thousands of nucleotides. These interactions reveal that the viral genome and subgenomes adopt alternative topologies inside cells and engage in different interactions with host RNAs. Notably, we discovered a long-range RNA-RNA interaction, the FSE-arch, that encircles the programmed ribosomal frameshifting element. The FSE-arch is conserved in the related MERS-CoV and is under purifying selection. Our findings illuminate RNA structure-based mechanisms governing replication, discontinuous transcription, and translation of coronaviruses and will aid future efforts to develop antiviral strategies.This work was supported by Cancer Research UK grants (C13474/A18583, C6946/A14492) to E.A.M.; Wellcome grants (104640/Z/14/Z, 092096/Z/10/Z) to E.A.M.

Apollo (Cambridge)

Constitutive overexpression of the TaNF-YB4 gene in transgenic wheat significantly improves grain yield

Author: Bazanova N.
Borisjuk N.
Chirkova L.
Hrmova M.
Ismagul A.
Kovalchuk N.
Langridge P.
Lopato S.
Parent B.
Shavrukov Y.
Yadav D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

First published online: July 27, 2015Heterotrimeric nuclear factors Y (NF-Ys) are involved in regulation of various vital functions in all eukaryotic organisms. Although a number of NF-Y subunits have been characterized in model plants, only a few have been functionally evaluated in crops. In this work, a number of genes encoding NF-YB and NF-YC subunits were isolated from drought-tolerant wheat (Triticum aestivum L. cv. RAC875), and the impact of the overexpression of TaNF-YB4 in the Australian wheat cultivar Gladius was investigated. TaNF-YB4 was isolated as a result of two consecutive yeast two-hybrid (Y2H) screens, where ZmNF-YB2a was used as a starting bait. A new NF-YC subunit, designated TaNF-YC15, was isolated in the first Y2H screen and used as bait in a second screen, which identified two wheat NF-YB subunits, TaNF-YB2 and TaNF-YB4. Three-dimensional modelling of a TaNF-YB2/TaNF-YC15 dimer revealed structural determinants that may underlie interaction selectivity. The TaNF-YB4 gene was placed under the control of the strong constitutive polyubiquitin promoter from maize and introduced into wheat by biolistic bombardment. The growth and yield components of several independent transgenic lines with up-regulated levels of TaNF-YB4 were evaluated under well-watered conditions (T1-T3 generations) and under mild drought (T2 generation). Analysis of T2 plants was performed in large deep containers in conditions close to field trials. Under optimal watering conditions, transgenic wheat plants produced significantly more spikes but other yield components did not change. This resulted in a 20-30% increased grain yield compared with untransformed control plants. Under water-limited conditions transgenic lines maintained parity in yield performance.Dinesh Yadav, Yuri Shavrukov, Natalia Bazanova, Larissa Chirkova, Nikolai Borisjuk, Nataliya Kovalchuk, Ainur Ismagul, Boris Parent, Peter Langridge, Maria Hrmova and Sergiy Lopat

Crossref

Adelaide Research & Scholarship

PubMed Central

ProdInra

Alignment uncertainty, regressive alignment and large scale deployment

Author: Floden Evan, 1985-
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 01/01/2018
Field of study

A multiple sequence alignment (MSA) provides a description of the relationship between biological sequences where columns represent a shared ancestry through an implied set of evolutionary events. The majority of research in the field has focused on improving the accuracy of alignments within the progressive alignment framework and has allowed for powerful inferences including phylogenetic reconstruction, homology modelling and disease prediction. Notwithstanding this, when applied to modern genomics datasets - often comprising tens of thousands of sequences - new challenges arise in the construction of accurate MSA. These issues can be generalised to form three basic problems. Foremost, as the number of sequences increases, progressive alignment methodologies exhibit a dramatic decrease in alignment accuracy. Additionally, for any given dataset many possible MSA solutions exist, a problem which is exacerbated with an increasing number of sequences due to alignment uncertainty. Finally, technical difficulties hamper the deployment of such genomic analysis workflows - especially in a reproducible manner - often presenting a high barrier for even skilled practitioners. This work aims to address this trifecta of problems through a web server for fast homology extension based MSA, two new methods for improved phylogenetic bootstrap supports incorporating alignment uncertainty, a novel alignment procedure that improves large scale alignments termed regressive MSA and finally a workflow framework that enables the deployment of large scale reproducible analyses across clusters and clouds titled Nextflow. Together, this work can be seen to provide both conceptual and technical advances which deliver substantial improvements to existing MSA methods and the resulting inferences.Un alineament de seqüència múltiple (MSA) proporciona una descripció de la relació entre seqüències biològiques on les columnes representen una ascendència compartida a través d'un conjunt implicat d'esdeveniments evolutius. La majoria de la investigació en el camp s'ha centrat a millorar la precisió dels alineaments dins del marc d'alineació progressiva i ha permès inferències poderoses, incloent-hi la reconstrucció filogenètica, el modelatge d'homologia i la predicció de malalties. Malgrat això, quan s'aplica als conjunts de dades de genòmica moderns, que sovint comprenen desenes de milers de seqüències, sorgeixen nous reptes en la construcció d'un MSA precís. Aquests problemes es poden generalitzar per formar tres problemes bàsics. En primer lloc, a mesura que augmenta el nombre de seqüències, les metodologies d'alineació progressiva presenten una disminució espectacular de la precisió de l'alineació. A més, per a un conjunt de dades, existeixen molts MSA com a possibles solucions un problema que s'agreuja amb un nombre creixent de seqüències a causa de la incertesa d'alineació. Finalment, les dificultats tècniques obstaculitzen el desplegament d'aquests fluxos de treball d'anàlisi genòmica, especialment de manera reproduïble, sovint presenten una gran barrera per als professionals fins i tot qualificats. Aquest treball té com a objectiu abordar aquesta trifecta de problemes a través d'un servidor web per a l'extensió ràpida d'homologia basada en MSA, dos nous mètodes per a la millora de l'arrencada filogenètica permeten incorporar incertesa d'alineació, un nou procediment d'alineació que millora els alineaments a gran escala anomenat MSA regressivu i, finalment, un marc de flux de treball permet el desplegament d'anàlisis reproduïbles a gran escala a través de clústers i computació al núvol anomenat Nextflow. En conjunt, es pot veure que aquest treball proporciona tant avanços conceptuals com tècniques que proporcionen millores substancials als mètodes MSA existents i les conseqüències resultants

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa