5 research outputs found

    High performance computing improvements on bioinformatics consistency-based multiple sequence alignment tools

    No full text
    Multiple Sequence Alignment (MSA) is essential for a wide range of applications in Bioinformatics. Traditionally, the alignment accuracy was the main metric used to evaluate the goodness of MSA tools. However, with the growth of sequencing data, other features, such as performance and the capacity to align larger datasets, are gaining strength. To achieve these new requirements, without affecting accuracy, the use of high-performance computing (HPC) resources and techniques is crucial. In this paper, we apply HPC techniques in T-Coffee, one of the more accurate but less scalable MSA tools. We integrate three innovative solutions into T-Coffee: the Balanced Guide Tree to increase the parallelism/performance, the Optimized Library Method with the aim of enhancing the scalability and the Multiple Tree Alignment, which explores different alignments in parallel to improve the accuracy. The results obtained show that the resulting tool, MTA-TCoffee, is able to improve the scalability in both the execution time and also the number of sequences to be aligned. Furthermore, not only is the alignment accuracy not affected by these improvements, as would be expected, but it improves significantly. Finally, we emphasize that the presented methods are not just restricted to T-Coffee, but may be implemented in any other alignment tools that use similar algorithms (progressive alignment, consistency or guide trees).This work was supported by the Government of Spain TIN2011–28689-C02–02, TIN2010–12011-E, Consolider CSD2007–00050 and the CUR of GENCAT. Cedric Notredame is funded by the Plan Nacional BFU2011–28575 and the European Commission FP7, LEISHDRUG Project (No. 223414) and The Quantomics Project (KBBE-2A-222664)

    High performance computing improvements on bioinformatics consistency-based multiple sequence alignment tools

    No full text
    Multiple Sequence Alignment (MSA) is essential for a wide range of applications in Bioinformatics. Traditionally, the alignment accuracy was the main metric used to evaluate the goodness of MSA tools. However, with the growth of sequencing data, other features, such as performance and the capacity to align larger datasets, are gaining strength. To achieve these new requirements, without affecting accuracy, the use of high-performance computing (HPC) resources and techniques is crucial. In this paper, we apply HPC techniques in T-Coffee, one of the more accurate but less scalable MSA tools. We integrate three innovative solutions into T-Coffee: the Balanced Guide Tree to increase the parallelism/performance, the Optimized Library Method with the aim of enhancing the scalability and the Multiple Tree Alignment, which explores different alignments in parallel to improve the accuracy. The results obtained show that the resulting tool, MTA-TCoffee, is able to improve the scalability in both the execution time and also the number of sequences to be aligned. Furthermore, not only is the alignment accuracy not affected by these improvements, as would be expected, but it improves significantly. Finally, we emphasize that the presented methods are not just restricted to T-Coffee, but may be implemented in any other alignment tools that use similar algorithms (progressive alignment, consistency or guide trees).This work was supported by the Government of Spain TIN2011–28689-C02–02, TIN2010–12011-E, Consolider CSD2007–00050 and the CUR of GENCAT. Cedric Notredame is funded by the Plan Nacional BFU2011–28575 and the European Commission FP7, LEISHDRUG Project (No. 223414) and The Quantomics Project (KBBE-2A-222664)

    Estudo comparativo in silico dos produtos de excreção ou secreção de Echinococcus granulosus e Echinococcus multiloculares

    Get PDF
    As fases larvais (metacestódeos) de Echinococcus granulosus e Echinococcus multilocularis causam diferentes formas de equinococose em diferentes espécies de hospedeiros intermediários, incluindo o homem. Os metacestódeos são capazes de sobreviver por anos no hospedeiro humano muito devido a proteínas secretadas, as quais possuem atividades imunomoduladoras e proteolíticas, por exemplo. O presente trabalho apresenta predições in silico dos conjuntos de proteínas secretáveis (secretoma) de E. granulosus e E. multilocularis baseadas em dados genômicos, realiza comparações entre estes secretomas preditos, identifica possíveis problemas de predição e apresenta alternativas para obter com predições in silico dados representativos dos secretomas reais destas espécies. A predição inicial dos secretomas de E. granulosus e E. multilocularis (662, 669 proteínas, respectivamente) apresentou valores semelhantes (~6,4% do proteoma predito), o esperado entre duas espécies próximas. Porém, a análise comparativa entre os pares de ortólogos secretados/não-secretados (PSNS) indicou possíveis problemas de predição de secreção por via não-clássica e de anotação de sequências genômicas. O software WoLF PSORT foi utilizado para implementar a predição de secreção por via não-clássica, diminuindo o número de inconsistências de 214 para 114 PSNS. Refinou-se a anotação dos genomas de E. granulosus e E. multilocularis usando-se estratégia conjunta entre dados de bibliotecas de RNA-seq, ESTs e com os dados das predições ab initio do sequenciamento original. O secretoma de E. granulosus predito com os dados da reanotação reduziu de 662 proteínas para 658, destas, 43% mantiveram as sequências originais. Em E. multilocularis, reduziu de 669 proteínas para 581, 48% destas mantiveram as sequências originais. A qualidade das sequências melhorou com o refinamento da anotação, porém as inconsistências na predição de secreção por via não-clássica mantiveram proporções semelhantes às das sequências sem o refinamento, demonstrando a importância de algoritmos treinados adequadamente para os dados analisados ou de alternativas de workflows de análise com critérios bem delineados visando à obtenção de dados mais representativos do conjunto real de proteínas secretadas.The larval stages (metacestode) of Echinococcus granulosus and Echinococcus multilocularis cause different forms of echinococcosis in different species of intermediate hosts, including humans. Metacestodes are able to survive for years in the human host, much due to its secreted proteins, which have immunomodulatory and proteolytic activities, for example. The present work presents in silico predictions of the secretable protein sets (secretome) of E. granulosus and E. multilocularis based on genomic data, makes comparisons between these predicted secretomes, identifies possible mispredictios and presents alternatives to obtain in silico predictions more representative of the real secretomes of these species. The initial prediction of E. granulosus and E. multilocularis (662, 669 proteins, respectively) presented similar values (6.4% of predicted proteome), expected between two species. However, the comparative analysis between the secreted/non-secreted orthopairs (PSNS) indicated possible problems of non- classical secretion prediction and annotation of genomic sequences. The WoLF PSORT software was used to implement the prediction of non-classical secretion, reducing the number of inconsistencies from 214 to 114 PSNS. The genomes of E. granulosus and E. multilocularis were annotated using a combined strategy between data from RNA-seq libraries, ESTs and the ab initio prediction data from the original genomic sequencing. The predicted E. granulosus secretome with the reannotation reduced from 662 proteins to 658, 43% kept the original sequences. In E. multilocularis, reduced from 669 proteins to 581, 48% of these kept the original sequences. The quality of the sequences improved with the refinement of the annotation, but the inconsistencies in the prediction of non-classical secretion maintained proportions similar to the sequences without the refinement, demonstrating the importance of having adequately trained algorithms for the analyzed data or alternatives of workflows with well delineated criteria aiming to obtain more representative data of the real set of secreted proteins
    corecore