49 research outputs found

    External memory BWT and LCP computation for sequence collections with applications

    Get PDF
    Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.14CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQCOORDENAÇÃO DE APERFEIÇOAMENTO DE PESSOAL DE NÍVEL SUPERIOR - CAPESUniversity of Eastern Piedmont project Behavioural Types for Dependability Analysis with Bayesian Networks; Sao Paulo Research Foundation (FAPESP)Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP) [2017/09105-0, 2018/21509-2]; PRIN grant [201534HNXC]; INdAM-GNCS Project 2019 Innovative methods for the solution of medical and biological big data; Brazilian agency Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq)National Council for Scientific and Technological Development (CNPq); Brazilian agency Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)CAPE

    Integrative analysis to select cancer candidate biomarkers to targeted validation

    Get PDF
    FAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOTargeted proteomics has flourished as the method of choice for prospecting for and validating potential candidate biomarkers in many diseases. However, challenges still remain due to the lack of standardized routines that can prioritize a limited number of proteins to be further validated in human samples. To help researchers identify candidate biomarkers that best characterize their samples under study, a well-designed integrative analysis pipeline, comprising MS-based discovery, feature selection methods, clustering techniques, bioinformatic analyses and targeted approaches was performed using discovery-based proteomic data from the secretomes of three classes of human cell lines (carcinoma, melanoma and non-cancerous). Three feature selection algorithms, namely, Beta-binomial, Nearest Shrunken Centroids (NSC), and Support Vector Machine-Recursive Features Elimination (SVM-RFE), indicated a panel of 137 candidate biomarkers for carcinoma and 271 for melanoma, which were differentially abundant between the tumor classes. We further tested the strength of the pipeline in selecting candidate biomarkers by immunoblotting, human tissue microarrays, label-free targeted MS and functional experiments. In conclusion, the proposed integrative analysis was able to pre-qualify and prioritize candidate biomarkers from discovery-based proteomics to targeted MS.Targeted proteomics has flourished as the method of choice for prospecting for and validating potential candidate biomarkers in many diseases. However, challenges still remain due to the lack of standardized routines that can prioritize a limited number of proteins to be further validated in human samples. To help researchers identify candidate biomarkers that best characterize their samples under study, a well-designed integrative analysis pipeline, comprising MS-based discovery, feature selection methods, clustering techniques, bioinformatic analyses and targeted approaches was performed using discovery-based proteomic data from the secretomes of three classes of human cell lines (carcinoma, melanoma and non-cancerous). Three feature selection algorithms, namely, Beta-binomial, Nearest Shrunken Centroids (NSC), and Support Vector Machine-Recursive Features Elimination (SVM-RFE), indicated a panel of 137 candidate biomarkers for carcinoma and 271 for melanoma, which were differentially abundant between the tumor classes. We further tested the strength of the pipeline in selecting candidate biomarkers by immunoblotting, human tissue microarrays, label-free targeted MS and functional experiments. In conclusion, the proposed integrative analysis was able to pre-qualify and prioritize candidate biomarkers from discovery-based proteomics to targeted MS6414363543652FAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOFAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO2009/54067-3; 2010/19278-0; 2011/22421-2; 2009/53839-2470567/2009-0; 470549/2011-4; 301702/2011-0; 470268/2013-

    Métodos computacionais em Biologia: biomarcadores de câncer, redes de proteínas e transmissão lateral de genes

    No full text
    Molecular Biology is a branch within Science of great importance. Despite the fact it studies microscopic entities, the volume and complexity of information are great. The applications are varied and can be of global interest, such as the spread of antibiotic resistance genes among bacteria and new methods for diagnostic and prognostic of cancer. By understanding biomolecular mechanisms, scientists can define treatments for diseases, support the decisions made by patients, identify the influence of intestinal microbiota over physical and psychological conditions, find cause and source of microbial antibiotic resistance, among many other applications. Computer Science plays key roles in this context, such as enabling complex data analyzes by specialists, creating models that simulate biological structures and processes, and by providing algorithms for extracting information encoded in biological data. During my doctorate, we explored those mechanisms in three main levels: quantification of proteins from cells, analysis of interactions that happen inside cells, and the comparison of genomes and their genetic history. This manuscript reports different projects, four of them already published in scientific journals. They comprise the discovery of candidate proteins for cancer biomarkers, the visual analysis of protein-protein interaction networks and the visual analysis of lateral gene transfer in bacterial phylogenetic trees. Here, we explain these projects and the main findings associated with the use of computational methods. Among the results are the evaluation of stability of ranking and signature methods applied to discovery proteomics data, a new approach to select candidate proteins from discovery to targeted proteomics, lists of candidate biomarkers for oral cancer, and new techniques for the visualization of biological networks and phylogenetic supertrees.A Biologia Molecular é um ramo da Ciência de grande importância. Apesar de estudar entidades microscópicas, o volume e a complexidade das informações são imensos. Suas aplicações são variadas e podem ser de interesse global, como a disseminação de genes de resistência a antibióticos entre bactérias e novos métodos para diagnóstico e prognóstico de câncer. Entendendo os mecanismos biomoleculares, cientistas podem definir tratamentos para doenças, apoiar as decisões tomadas pelos pacientes, identificar a influência da microbiota intestinal sobre as condições físicas e psicológicas, encontrar causas e fontes de resistência microbiana aos antibióticos, entre muitas outras aplicações. A Ciência da Computação desempenha papéis-chave nesse contexto, como permitir análises complexas de dados por especialistas, criar modelos que simulam estruturas e processos biológicos e fornecer algoritmos para extrair informações codificadas em dados biológicos. Durante meu doutorado, exploramos esses mecanismos em três níveis principais: quantificação de proteínas a partir de células, análise de interações que ocorrem dentro das células e comparação de genomas e seus históricos. Este manuscrito relata diferentes projetos, quatro deles já publicados em revistas científicas. Eles compreendem a descoberta de proteínas candidatas a biomarcadores de câncer, a análise visual de redes de interação proteína-proteína e a análise visual da transferência lateral de genes em árvores filogenéticas bacterianas. Aqui, explicamos esses projetos e as principais descobertas associadas ao uso de métodos computacionais. Entre os resultados estão a avaliação da estabilidade dos métodos de ranqueamento e assinaturas aplicados aos dados proteômicos de descoberta, uma nova abordagem para selecionar proteínas candidatas desde a descoberta até proteômica direcionada, listas de candidatos a biomarcadores para câncer oral e novas técnicas para a visualização de redes biológicas e supertrees filogenéticas

    A visual approach to comparative analysis of biomolecular networks with support of Venn diagrams

    No full text
    Sistemas biológicos podem ser representados por redes que armazenam não apenas informações de conectividade, mas também informações de características de seus nós. No contexto biomolecular, esses nós podem representar proteínas, metabólitos, entre outros tipos de moléculas. Cada molécula possui características anotadas e armazenadas em bases de dados como o Gene Ontology. A comparação visual dessas redes depende de ferramentas que permitam o usuário identificar diferenças e semelhanças entre as anotações feitas sobre as moléculas (atributos) e também sobre as interações conhecidas (conexões). Neste trabalho de mestrado, buscou-se desenvolver técnicas que facilitem a comparação desses atributos sobre as moléculas, tentando manter no processo a visualização das redes em que essas moléculas estão inseridas. Como resultado, obteve-se a ferramenta VisPipeline-MultiNetwork, que permite comparar até seis redes, utilizando operações de conjuntos sobre as redes e sobre seus atributos. Dessa forma, diferentemente da maioria das ferramentas conhecidas para a visualização de redes biológicas, o VisPipeline-MultiNetwork permite a criação de redes cujos atributos são derivados das redes originais por meio de operações de união, intersecção e valores exclusivos. A comparação visual das redes é feita pela visualização do resultado dessas operações de conjuntos sobre as redes, por meio de um método de comparação lado-a-lado. Já a comparação dos atributos armazenados nos nós das redes é feita por meio de diagramas de Venn. Para auxiliar este tipo de comparação, a técnica InteractiVenn foi desenvolvida, em que o usuário pode interagir com um diagrama de Venn efetuando operações de união entre conjuntos. Essas operações de união aplicadas sobre os conjuntos são também aplicadas sobre as respectivas formas no diagrama. Esta característica da técnica a diferencia das outras ferramentas de criação de diagramas de Venn. Integrando essas funcionalidades, o usuário é capaz de comparar redes sob diversas perspectivas. Para exemplificar a utilização do VisPipeline-MultiNetwork, dois casos no contexto biomolecular foram estudados. Adicionalmente, uma ferramenta web para a comparação de listas de cadeias de caracteres por meio de diagramas de Venn foi desenvolvida. Ela também implementa a técnica InteractiVenn e foi denominada InteractiVenn website.Biological systems can be represented by networks that store not only connectivity information, but also node feature information. In the context of molecular biology, these nodes may represent proteins, metabolites, and other types of molecules. Each molecule has features annotated and stored in databases such as Gene Ontology. A visual comparison of networks requires tools that allow the user to identify differences and similarities between nodes attributes as well as known interactions between nodes (connections). In this dissertation, we sought to develop a technique that would facilitate the comparison of these biological networks, striving to maintain in the process the visualization of the network connectivities. As a result, we have developed the VisPipeline-MultiNetwork tool, which allows comparison of up to six networks, using sets of operations on networks and on their attributes. Unlike most known tools for visualizing biological networks, VisPipeline-MultiNetwork allows the creation of networks whose attributes are derived from the original networks through operations of union, intersection and unique values. A visual comparison of the networks is achieved by visualizing the outcome of such joint operations through a all-in-one comparison method. The comparison of nodes attributes is performed using Venn diagrams. To assist this type of comparison, the InteractiVenn technique was developed, in which the user can interact with a Venn diagram, performing union operations between sets and their corresponding diagrams. This diagram union feature differs from other tools available for creating Venn diagrams. With these tools, users manage to compare networks from different perspectives. To exemplify the use of VisPipeline-MultiNetwork, two case studies were carried out in the biomolecular context. Additionally, a web tool for comparing lists of strings by means of Venn diagrams was made available. It also implements the InteractiVenn technique and its site has been named InteractiVenn

    Métodos computacionais em Biologia: biomarcadores de câncer, redes de proteínas e transmissão lateral de genes

    No full text
    Molecular Biology is a branch within Science of great importance. Despite the fact it studies microscopic entities, the volume and complexity of information are great. The applications are varied and can be of global interest, such as the spread of antibiotic resistance genes among bacteria and new methods for diagnostic and prognostic of cancer. By understanding biomolecular mechanisms, scientists can define treatments for diseases, support the decisions made by patients, identify the influence of intestinal microbiota over physical and psychological conditions, find cause and source of microbial antibiotic resistance, among many other applications. Computer Science plays key roles in this context, such as enabling complex data analyzes by specialists, creating models that simulate biological structures and processes, and by providing algorithms for extracting information encoded in biological data. During my doctorate, we explored those mechanisms in three main levels: quantification of proteins from cells, analysis of interactions that happen inside cells, and the comparison of genomes and their genetic history. This manuscript reports different projects, four of them already published in scientific journals. They comprise the discovery of candidate proteins for cancer biomarkers, the visual analysis of protein-protein interaction networks and the visual analysis of lateral gene transfer in bacterial phylogenetic trees. Here, we explain these projects and the main findings associated with the use of computational methods. Among the results are the evaluation of stability of ranking and signature methods applied to discovery proteomics data, a new approach to select candidate proteins from discovery to targeted proteomics, lists of candidate biomarkers for oral cancer, and new techniques for the visualization of biological networks and phylogenetic supertrees.A Biologia Molecular é um ramo da Ciência de grande importância. Apesar de estudar entidades microscópicas, o volume e a complexidade das informações são imensos. Suas aplicações são variadas e podem ser de interesse global, como a disseminação de genes de resistência a antibióticos entre bactérias e novos métodos para diagnóstico e prognóstico de câncer. Entendendo os mecanismos biomoleculares, cientistas podem definir tratamentos para doenças, apoiar as decisões tomadas pelos pacientes, identificar a influência da microbiota intestinal sobre as condições físicas e psicológicas, encontrar causas e fontes de resistência microbiana aos antibióticos, entre muitas outras aplicações. A Ciência da Computação desempenha papéis-chave nesse contexto, como permitir análises complexas de dados por especialistas, criar modelos que simulam estruturas e processos biológicos e fornecer algoritmos para extrair informações codificadas em dados biológicos. Durante meu doutorado, exploramos esses mecanismos em três níveis principais: quantificação de proteínas a partir de células, análise de interações que ocorrem dentro das células e comparação de genomas e seus históricos. Este manuscrito relata diferentes projetos, quatro deles já publicados em revistas científicas. Eles compreendem a descoberta de proteínas candidatas a biomarcadores de câncer, a análise visual de redes de interação proteína-proteína e a análise visual da transferência lateral de genes em árvores filogenéticas bacterianas. Aqui, explicamos esses projetos e as principais descobertas associadas ao uso de métodos computacionais. Entre os resultados estão a avaliação da estabilidade dos métodos de ranqueamento e assinaturas aplicados aos dados proteômicos de descoberta, uma nova abordagem para selecionar proteínas candidatas desde a descoberta até proteômica direcionada, listas de candidatos a biomarcadores para câncer oral e novas técnicas para a visualização de redes biológicas e supertrees filogenéticas

    A.H. Heberle Nurseries [catalog].

    No full text
    192
    corecore