14 research outputs found

    A network medicine approach to quantify distance between hereditary disease modules on the interactome

    Get PDF
    We introduce a MeSH-based method that accurately quantifies similarity between heritable diseases at molecular level. This method effectively brings together the existing information about diseases that is scattered across the vast corpus of biomedical literature. We prove that sets of MeSH terms provide a highly descriptive representation of heritable disease and that the structure of MeSH provides a natural way of combining individual MeSH vocabularies. We show that our measure can be used effectively in the prediction of candidate disease genes. We developed a web application to query more than 28.5 million relationships between 7,574 hereditary diseases (96% of OMIM) based on our similarity measure

    Interspecies gene function prediction using semantic similarity

    Get PDF

    Predicting protein function via downward random walks on a gene ontology

    Get PDF

    Spotlite: Web Application and Augmented Algorithms for Predicting Co-Complexed Proteins from Affinity Purification – Mass Spectrometry Data

    Get PDF
    Protein-protein interactions defined by affinity purification and mass spectrometry (APMS) approaches suffer from high false discovery rates. Consequently, the candidate interaction lists must be pruned of contaminants before network construction and interpretation, historically an expensive and time-intensive task. In recent years, numerous computational methods have been developed to identify genuine interactions from hundreds revealed by APMS experiments. Here, comparative analysis of several popular algorithms revealed complementarity in their classification accuracies, which is supported by their divergent scoring strategies. As such, we used two accurate and computationally efficient methods as features for machine learning using the Random Forest algorithm. Additionally, we developed novel mathematical models to include a variety of indirect data, such as mRNA co-expression, gene ontologies and homologous protein interactions as features within the classification problem. We show that our method, which we call Spotlite, outperforms existing methods on four diverse and public APMS datasets. Because implementation of existing APMS scoring methods requires computational expertise beyond many laboratories, we created a user-friendly and fast web application for APMS data scoring, analysis, annotation and network visualization, for use on new and existing data (http://152.19.87.94:8080/spotlite). The utility of Spotlite and its visualization platform for revealing physical, functional and disease-relevant characteristics within APMS data is established through a focused analysis of the KEAP1 E3 ubiquitin ligase

    Validation of automatic similarity measures

    Get PDF
    Tese de mestrado em Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2020A capacidade para comparar automaticamente duas entidades biomédicas (p. ex. doenças, vias metabólicas ou artigos científicos) permite que os computadores raciocinem sobre o conhecimento científico. Assim sendo, fazer a validação destas medidas é essencial para garantir que os resultados produzidos por elas reflictam o actual conhecimento colectivo sobre o respectivo domínio. Uma das estratégias para avaliar se a medida é precisa e funcional é a validação manual por parte de peritos. Contudo, este processo é ineficiente devido a toda a pesquisa secundária necessária para o fazer, o que significa que compilar grandes conjuntos de dados de valores de semelhança atribuídos por humanos é uma tarefa difícil. “Manual Validation Helper Tool” (MVHT) é uma aplicação web criada com o intuito de acelerar esta validação manual, em conjunto com um formato capaz de acomodar os diversos tipos de dados em forma de anotações, provenientes de diferentes ontologias ou domínios. MVHT foi testada em quatro datasets distintos e um deles foi apresentado a utilizadores piloto para que dessem o seu feedback acerca do que poderia ser melhorado na aplicação, bem como para se obter um gold-standard de semelhança manual. Com o seu auxílio, a ferramenta foi optimizada e encontra-se acessível para ser usada por criadores de medidas de semelhança semântica, que por sua vez podem partilhar os seus datasets de forma prática, os quais peritos podem visitar e rapidamente começar a comparar pares de entidades.The ability to automatically compare two biomedical entities (e.g. diseases, biochemical pathways, papers) enables the use of computers to reason over scientific knowledge. As such, validating these measures is essential to ensure that the results they produce reflect the current community knowledge on the respective domain. Manual validation by experts is one of the strategies to assess whether a measure is sound and accurate. However, this is an inefficient process because of the secondary research required to do so, which means that compiling large datasets of human-curated similarity values is difficult. The “Manual Validation Helper Tool” (MVHT) is a web application created to accelerate this manual validation, coupled to a format that can accommodate different types of data in the form of annotations, from different domains or ontologies. MVHT was tested on four distinct datasets and one of them was given to pilot users so they could provide feedback on the application, as well as to gather a gold-standard of manual similarity. With their help the tool was optimized and is accessible to be used by creators of semantic similarity measures, who can share their datasets in a more practical way via generated URLs, which other people can visit and quickly start comparing pairs of entities

    Hierarchical ensemble methods for protein function prediction

    Get PDF
    Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware \u201cflat\u201d prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a \u201cconsensus\u201d ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph
    corecore