25,509 research outputs found
Metrics for GO based protein semantic similarity: a systematic evaluation
<p>Abstract</p> <p>Background</p> <p>Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations.</p> <p>Results</p> <p>We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation.</p> <p>Conclusions</p> <p>This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid <it>simGIC</it> was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.</p
Applications of semantic similarity measures
There has been much interest in uncovering protein-protein interactions and
their underlying domain-domain interactions. Many experimental techniques
have been developed, for example yeast-two-hybrid screening and tandem
affinity purification. Since it is time consuming and expensive to perform
exhaustive experimental screens, in silico methods are used for predicting
interactions. However, all experimental and computational methods have
considerable false positive and false negative rates. Therefore, it is
necessary to validate experimentally determined and predicted interactions.
One possibility for the validation of interactions is the comparison of the
functions of the proteins or domains. Gene Ontology (GO) is widely accepted
as a standard vocabulary for functional terms, and is used for annotating
proteins and protein families with biological processes and their molecular
functions. This annotation can be used for a functional comparison of
interacting proteins or domains using semantic similarity measures.
Another application of semantic similarity measures is the prioritization
of disease genes. It is know that functionally similar proteins are often
involved in the same or similar diseases. Therefore, functional similarity
is used for predicting disease associations of proteins.
In the first part of my talk, I will introduce some semantic and functional
similarity measures that can be used for comparison of GO terms and
proteins or protein families. Then, I will show their application for
determining a confidence threshold for domain-domain interaction
predictions. Additionally, I will present FunSimMat
(http://www.funsimmat.de/), a comprehensive resource of functional
similarity values available on the web. In the last part, I will introduce
the problem of comparing diseases, and a first attempt to apply functional
similarity measures based on GO to this problem
A benchmark for biomedical knowledge graph based similarity
Tese de mestrado em Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de CiĂŞncias, 2020Os grafos de conhecimento biomĂ©dicos sĂŁo cruciais para sustentar aplicações em grandes quantidades de dados nas ciĂŞncias da vida e saĂşde. Uma das aplicações mais comuns dos grafos de conhecimento nas ciĂŞncias da vida Ă© o apoio Ă comparação de entidades no grafo por meio das suas descrições ontolĂłgicas. Estas descrições suportam o cálculo da semelhança semântica entre duas entidades, e encontrar as suas semelhanças e diferenças Ă© uma tĂ©cnica fundamental para diversas aplicações, desde a previsĂŁo de interações proteĂna-proteĂna atĂ© Ă descoberta de associações entre doenças e genes, a previsĂŁo da localização celular de proteĂnas, entre outros. Na Ăşltima dĂ©cada, houve um esforço considerável no desenvolvimento de medidas de semelhança semântica para grafos de conhecimento biomĂ©dico mas, atĂ© agora, a investigação nessa área tem-se concentrado na comparação de conjuntos de entidades relativamente pequenos. Dada a diversa gama de aplicações para medidas de semelhança semântica, Ă© essencial apoiar a avaliação em grande escala destas medidas. No entanto, fazĂŞ-lo nĂŁo Ă© trivial, uma vez que nĂŁo há um padrĂŁo ouro para a semelhança de entidades biolĂłgicas. Uma solução possĂvel Ă© comparar estas medidas com outras medidas ou proxies de semelhança. As entidades biolĂłgicas podem ser comparadas atravĂ©s de diferentes ângulos, por exemplo, a semelhança de sequĂŞncia e estrutural de duas proteĂnas ou as vias metabĂłlicas afetadas por duas doenças. Estas medidas estĂŁo relacionadas com as caracterĂsticas relevantes das entidades, portanto podem ajudar a compreender como Ă© que as abordagens de semelhança semântica capturam a semelhança das entidades. O objetivo deste trabalho Ă© desenvolver um benchmark, composto por data sets e mĂ©todos de avaliação automatizados. Este benchmark deve sustentar a avaliação em grande escala de medidas de semelhança semântica para entidades biolĂłgicas, com base na sua correlação com diferentes propriedades das entidades. Para atingir este objetivo, uma metodologia para o desenvolvimento de data sets de referĂŞncia para semelhança semântica foi desenvolvida e aplicada a dois grafos de conhecimento: proteĂnas anotadas com a Gene Ontology e genes anotados com a Human Phenotype Ontology. Este benchmark explora proxies de semelhança com base na semelhança de sequĂŞncia, função molecular e interações de proteĂnas e semelhança de genes baseada em fenĂłtipos, e fornece cálculos de semelhança semântica com medidas representativas do estado da arte, para uma avaliação comparativa. Isto resultou num benchmark composto por uma coleção de 21 data sets de referĂŞncia com tamanhos variados, cobrindo quatro espĂ©cies e diferentes nĂveis de anotação das entidades, e tĂ©cnicas de avaliação ajustadas aos data sets.Biomedical knowledge graphs are crucial to support data intensive applications in the life sciences and healthcare. One of the most common applications of knowledge graphs in the life sciences is to support the comparison of entities in the graph through their ontological descriptions. These descriptions support the calculation of semantic similarity between two entities, and finding their similarities and differences is a cornerstone technique for several applications, ranging from prediction of protein-protein interactions to the discovering of associations between diseases and genes, the prediction of cellular localization of proteins, among others. In the last decade there has been a considerable effort in developing semantic similarity measures for biomedical knowledge graphs, but the research in this area has so far focused on the comparison of relatively small sets of entities. Given the wide range of applications for semantic similarity measures, it is essential to support the large-scale evaluation of these measures. However, this is not trivial since there is no gold standard for biological entity similarity. One possible solution is to compare these measures to other measures or proxies of similarity. Biological entities can be compared through different lenses, for instance the sequence and structural similarity of two proteins or the metabolic pathways affected by two diseases. These measures relate to relevant characteristics of the underlying entities, so they can help to understand how well semantic similarity approaches capture entity similarity. The goal of this work is to develop a benchmark for semantic similarity measures, composed of data sets and automated evaluation methods. This benchmark should support the large-scale evaluation of semantic similarity measures for biomedical entities, based on their correlation to different properties of biological entities. To achieve this goal, a methodology for the development of benchmark data sets for semantic similarity was developed and applied to two knowledge graphs: proteins annotated with the Gene Ontology and genes annotated with the Human Phenotype Ontology. This benchmark explores proxies of similarity calculated based on protein sequence similarity, protein molecular function similarity, protein-protein interactions and phenotype-based gene similarity, and provides semantic similarity computations with state-of-the-art representative measures, for a comparative evaluation of the measures. This resulted in a benchmark made up of a collection of 21 benchmark data sets with varying sizes, covering four different species at different levels of annotation completion and evaluation techniques fitted to the data sets characteristics
Self-adaptive GA, quantitative semantic similarity measures and ontology-based text clustering
As the common clustering algorithms use vector space model (VSM) to represent document, the conceptual relationships between related terms which do not co-occur literally are ignored. A genetic algorithm-based clustering technique, named GA clustering, in conjunction with ontology is proposed in this article to overcome this problem. In general, the ontology measures can be partitioned into two categories: thesaurus-based methods and corpus-based methods. We take advantage of the hierarchical structure and the broad coverage taxonomy of Wordnet as the thesaurus-based ontology. However, the corpus-based method is rather complicated to handle in practical application. We propose a transformed latent semantic analysis (LSA) model as the corpus-based method in this paper. Moreover, two hybrid strategies, the combinations of the various similarity measures, are implemented in the clustering experiments. The results show that our GA clustering algorithm, in conjunction with the thesaurus-based and the LSA-based method, apparently outperforms that with other similarity measures. Moreover, the superiority of the GA clustering algorithm proposed over the commonly used k-means algorithm and the standard GA is demonstrated by the improvements of the clustering performance
ONTOLOGY BASED TECHNICAL SKILL SIMILARITY
Online job boards have become a major platform for technical talent procurement and job search. These job portals have given rise to challenging matching and search problems. The core matching or search happens between technical skills of the job requirements and the candidate\u27s profile or keywords. The extensive list of technical skills and its polyonymous nature makes it less effective to perform a direct keyword matching. This results in substandard job matching or search results which misses out a closely matching candidate on account of it not having the exact skills. It is important to use a semantic similarity measure between skills to improve the relevance of the results. This paper proposes a semantic similarity measure between technical skills using a knowledge based approach. The approach builds an ontology using DBpedia and uses it to derive a similarity score. Feature based ontology similarity measures are used to derive a similarity score between two skills. The ontology also helps in resolving a base skill from its multiple representations. The paper discusses implementation of custom ontology, similarity measuring system and performance of the system in comparing technical skills. The proposed approach performs better than the Resumatcher system in finding the similarity between skills. Keywords
Dealing with uncertain entities in ontology alignment using rough sets
This is the author's accepted manuscript. The final published article is available from the link below. Copyright @ 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Ontology alignment facilitates exchange of knowledge among heterogeneous data sources. Many approaches to ontology alignment use multiple similarity measures to map entities between ontologies. However, it remains a key challenge in dealing with uncertain entities for which the employed ontology alignment measures produce conflicting results on similarity of the mapped entities. This paper presents OARS, a rough-set based approach to ontology alignment which achieves a high degree of accuracy in situations where uncertainty arises because of the conflicting results generated by different similarity measures. OARS employs a combinational approach and considers both lexical and structural similarity measures. OARS is extensively evaluated with the benchmark ontologies of the ontology alignment evaluation initiative (OAEI) 2010, and performs best in the aspect of recall in comparison with a number of alignment systems while generating a comparable performance in precision
- …