19 research outputs found

    Sequence embedding for fast construction of guide trees for multiple sequence alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to <it>N</it><sup>2 </sup>for <it>N </it>sequences. When <it>N </it>grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.</p> <p>Results</p> <p>In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.</p> <p>Conclusions</p> <p>We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from <url>http://www.clustal.org/mbed.tgz</url>.</p

    Multidimensional Scaling Reveals the Main Evolutionary Pathways of Class A G-Protein-Coupled Receptors

    Get PDF
    Class A G-protein-coupled receptors (GPCRs) constitute the largest family of transmembrane receptors in the human genome. Understanding the mechanisms which drove the evolution of such a large family would help understand the specificity of each GPCR sub-family with applications to drug design. To gain evolutionary information on class A GPCRs, we explored their sequence space by metric multidimensional scaling analysis (MDS). Three-dimensional mapping of human sequences shows a non-uniform distribution of GPCRs, organized in clusters that lay along four privileged directions. To interpret these directions, we projected supplementary sequences from different species onto the human space used as a reference. With this technique, we can easily monitor the evolutionary drift of several GPCR sub-families from cnidarians to humans. Results support a model of radiative evolution of class A GPCRs from a central node formed by peptide receptors. The privileged directions obtained from the MDS analysis are interpretable in terms of three main evolutionary pathways related to specific sequence determinants. The first pathway was initiated by a deletion in transmembrane helix 2 (TM2) and led to three sub-families by divergent evolution. The second pathway corresponds to the differentiation of the amine receptors. The third pathway corresponds to parallel evolution of several sub-families in relation with a covarion process involving proline residues in TM2 and TM5. As exemplified with GPCRs, the MDS projection technique is an important tool to compare orthologous sequence sets and to help decipher the mutational events that drove the evolution of protein families

    A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

    Get PDF
    Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

    Multiobjective characteristic-based framework for very-large multiple sequence alignment

    Get PDF
    Rubio-Largo, Á., Vanneschi, L., Castelli, M., & Vega-Rodríguez, M. A. (2018). Multiobjective characteristic-based framework for very-large multiple sequence alignment. Applied Soft Computing Journal, 69, 719-736. [Advanced online publication on 27 June 2017]DOI: 10.1016/j.asoc.2017.06.022In the literature, we can find several heuristics for solving the multiple sequence alignment problem. The vast majority of them makes use of flags in order to modify certain alignment parameters; however, if no flags are used, the aligner will run with the default parameter configuration, which, often, is not the optimal one. In this work, we propose a framework that, depending on the biological characteristics of the input dataset, runs the aligner with the best parameter configuration found for another dataset that has similar biological characteristics, improving the accuracy and conservation of the obtained alignment. To train the framework, we use three well-known multiobjective evolutionary algorithms: NSGA-II, IBEA, and MOEA/D. Then, we perform a comparative study between several aligners proposed in the literature and the characteristic-based version of Kalign, MAFFT, and MUSCLE, when solving widely-used benchmarks (PREFAB v4.0 and SABmark v1.65) and very-large benchmarks with thousands of unaligned sequences (HomFam).authorsversionpublishe

    Constitutive overexpression of the TaNF-YB4 gene in transgenic wheat significantly improves grain yield

    Get PDF
    First published online: July 27, 2015Heterotrimeric nuclear factors Y (NF-Ys) are involved in regulation of various vital functions in all eukaryotic organisms. Although a number of NF-Y subunits have been characterized in model plants, only a few have been functionally evaluated in crops. In this work, a number of genes encoding NF-YB and NF-YC subunits were isolated from drought-tolerant wheat (Triticum aestivum L. cv. RAC875), and the impact of the overexpression of TaNF-YB4 in the Australian wheat cultivar Gladius was investigated. TaNF-YB4 was isolated as a result of two consecutive yeast two-hybrid (Y2H) screens, where ZmNF-YB2a was used as a starting bait. A new NF-YC subunit, designated TaNF-YC15, was isolated in the first Y2H screen and used as bait in a second screen, which identified two wheat NF-YB subunits, TaNF-YB2 and TaNF-YB4. Three-dimensional modelling of a TaNF-YB2/TaNF-YC15 dimer revealed structural determinants that may underlie interaction selectivity. The TaNF-YB4 gene was placed under the control of the strong constitutive polyubiquitin promoter from maize and introduced into wheat by biolistic bombardment. The growth and yield components of several independent transgenic lines with up-regulated levels of TaNF-YB4 were evaluated under well-watered conditions (T1-T3 generations) and under mild drought (T2 generation). Analysis of T2 plants was performed in large deep containers in conditions close to field trials. Under optimal watering conditions, transgenic wheat plants produced significantly more spikes but other yield components did not change. This resulted in a 20-30% increased grain yield compared with untransformed control plants. Under water-limited conditions transgenic lines maintained parity in yield performance.Dinesh Yadav, Yuri Shavrukov, Natalia Bazanova, Larissa Chirkova, Nikolai Borisjuk, Nataliya Kovalchuk, Ainur Ismagul, Boris Parent, Peter Langridge, Maria Hrmova and Sergiy Lopat

    Alignment uncertainty, regressive alignment and large scale deployment

    Get PDF
    A multiple sequence alignment (MSA) provides a description of the relationship between biological sequences where columns represent a shared ancestry through an implied set of evolutionary events. The majority of research in the field has focused on improving the accuracy of alignments within the progressive alignment framework and has allowed for powerful inferences including phylogenetic reconstruction, homology modelling and disease prediction. Notwithstanding this, when applied to modern genomics datasets - often comprising tens of thousands of sequences - new challenges arise in the construction of accurate MSA. These issues can be generalised to form three basic problems. Foremost, as the number of sequences increases, progressive alignment methodologies exhibit a dramatic decrease in alignment accuracy. Additionally, for any given dataset many possible MSA solutions exist, a problem which is exacerbated with an increasing number of sequences due to alignment uncertainty. Finally, technical difficulties hamper the deployment of such genomic analysis workflows - especially in a reproducible manner - often presenting a high barrier for even skilled practitioners. This work aims to address this trifecta of problems through a web server for fast homology extension based MSA, two new methods for improved phylogenetic bootstrap supports incorporating alignment uncertainty, a novel alignment procedure that improves large scale alignments termed regressive MSA and finally a workflow framework that enables the deployment of large scale reproducible analyses across clusters and clouds titled Nextflow. Together, this work can be seen to provide both conceptual and technical advances which deliver substantial improvements to existing MSA methods and the resulting inferences.Un alineament de seqüència múltiple (MSA) proporciona una descripció de la relació entre seqüències biològiques on les columnes representen una ascendència compartida a través d'un conjunt implicat d'esdeveniments evolutius. La majoria de la investigació en el camp s'ha centrat a millorar la precisió dels alineaments dins del marc d'alineació progressiva i ha permès inferències poderoses, incloent-hi la reconstrucció filogenètica, el modelatge d'homologia i la predicció de malalties. Malgrat això, quan s'aplica als conjunts de dades de genòmica moderns, que sovint comprenen desenes de milers de seqüències, sorgeixen nous reptes en la construcció d'un MSA precís. Aquests problemes es poden generalitzar per formar tres problemes bàsics. En primer lloc, a mesura que augmenta el nombre de seqüències, les metodologies d'alineació progressiva presenten una disminució espectacular de la precisió de l'alineació. A més, per a un conjunt de dades, existeixen molts MSA com a possibles solucions un problema que s'agreuja amb un nombre creixent de seqüències a causa de la incertesa d'alineació. Finalment, les dificultats tècniques obstaculitzen el desplegament d'aquests fluxos de treball d'anàlisi genòmica, especialment de manera reproduïble, sovint presenten una gran barrera per als professionals fins i tot qualificats. Aquest treball té com a objectiu abordar aquesta trifecta de problemes a través d'un servidor web per a l'extensió ràpida d'homologia basada en MSA, dos nous mètodes per a la millora de l'arrencada filogenètica permeten incorporar incertesa d'alineació, un nou procediment d'alineació que millora els alineaments a gran escala anomenat MSA regressivu i, finalment, un marc de flux de treball permet el desplegament d'anàlisis reproduïbles a gran escala a través de clústers i computació al núvol anomenat Nextflow. En conjunt, es pot veure que aquest treball proporciona tant avanços conceptuals com tècniques que proporcionen millores substancials als mètodes MSA existents i les conseqüències resultants
    corecore