8 research outputs found

    From Theoretical Framework To Generic Semantic Measures Library

    Get PDF
    International audienceThanks to the ever-increasing use of the Semantic Web, a growing number of entities (e.g. documents) are characterized by non-ambiguous mean-ings. Based on this characterization, entities can subsequently be compared us-ing semantic measures. A plethora of measures have been designed given their critical importance in numerous treatments relying on ontologies. However, the improvement and use of semantic measures are currently hampered by the lack of a dedicated theoretical framework and an extensive generic software solution dedicated to them. To meet these needs, this paper presents a unified theoretical framework of graph-based semantic measures, from which we developed the open source Semantic Measures Library and toolkit; a solution that paves the way for straightforward design, computation and analysis of semantic measures for both users and developers. Downloads, documentation and technical support at dedicated website http://www.semantic-measures-library.org

    Sélection Robuste de Mesures de Similarité Sémantique à partir de Données Incertaines d'Expertise

    Get PDF
    National audienceKnowledge-based semantic measures are cornerstone to exploit ontologies not only for exact inferences or retrieval processes, but also for data analyses and inexact searches. Abstract theoretical frameworks have recently been proposed in order to study the large diversity of measures available; they demonstrate that groups of measures are particular instantiations of general parameterized functions. In this paper, we study how such frameworks can be used to support the selection/design of measures. Based on (i) a theoretical framework unifying the measures, (ii) a software solution implementing this framework and (iii) a domain-specific benchmark, we define a semi-supervised learning technique to distinguish best measures for a concrete application. Next, considering uncertainty in both experts’ judgments and measures’ selection process, we extend this proposal for robust selection of semantic measures that best resists to these uncertainties. We illustrate our approach through a real use case in the biomedical domain..L'exploitation d'ontologies pour la recherche d'information, la dĂ©couverte de connaissances ou le raisonnement approchĂ© nĂ©cessite l'utilisation de mesures sĂ©mantiques qui permettent d'estimer le degrĂ© de similaritĂ© entre des entitĂ©s lexicales ou conceptuelles. RĂ©cemment un cadre thĂ©orique abstrait a Ă©tĂ© proposĂ© afin d'unifier la grande diversitĂ© de ces mesures, au travers de fonctions paramĂ©triques gĂ©nĂ©rales. Cet article propose une utilisation de ce cadre unificateur pour choisir une mesure. A partir du (i) cadre unificateur exprimant les mesures basĂ©es sur un ensemble limitĂ© de primitives, (ii) logiciel implĂ©mentant ce cadre et (iii) benchmark d'un domaine spĂ©cifique, nous utilisons une technique d'apprentissage semi-supervisĂ© afin de fournir la meilleure mesure sĂ©mantique pour une application donnĂ©e. Ensuite, sachant que les donnĂ©es fournies par les experts sont entachĂ©es d'incertitude, nous Ă©tendons notre approche pour choisir la plus robuste parmi les meilleures mesures, i.e. la moins perturbĂ©e par les erreurs d'Ă©valuation experte. Nous illustrons notre approche par une application dans le domaine biomĂ©dical. Mots-clĂ©s: Cadre unificateur, robustesse de mesures, incertitude d'expert, mesures de similaritĂ© sĂ©mantique, ontologies

    Benchmarking a new semantic similarity measure using fuzzy clustering and reference sets: Application to cancer expression data

    Get PDF
    International audienceClustering algorithms rely on a similarity or distance measure that directs the grouping of similar objects into the same cluster and the separation of distant objects between distinct clusters. Our recently described semantic similarity measure (IntelliGO), that applies to functional comparison of genes, is tested here for the first time in clustering experiments. The dataset is composed of genes contained in a benchmarking collection of reference sets. Heatmap visualization of hierarchical clustering illustrates the advantages of using the IntelliGO measure over three other similarity measures. Because genes often belong to more than one cluster in functional clustering, fuzzy C-means clustering is also applied to the dataset. The choice of the optimal number of clusters and clustering performance are evaluated by the F-score method using the reference sets. Overlap analysis is proposed as a method for exploiting the matching between clusters and reference sets. Finally, our method is applied to a list of genes found dysregulated in cancer samples. In this case, the reference sets are provided by expression profiles. Overlap analysis between these profiles and functional clusters obtained with fuzzy C-means clustering leads to characterize subsets of genes displaying consistent function and expression profiles.Les algorithmes de classification (Clustering) reposent sur des mesures de similaritĂ© ou de distance qui dirigent le regroupement des objets similaires dans un mĂȘme groupe et la sĂ©paration des objets diffĂ©rents entre des groupes distincts. Notre nouvelle mesure de similaritĂ© sĂ©mantique (IntelliGO), rĂ©cemment dĂ©crite, qui s'applique Ă  la comparaison fonctionnelle des gĂšnes, est testĂ©e ici dans un processus de clustering. L'ensemble de test est composĂ© des gĂšnes contenus dans une collection de classes de rĂ©fĂ©rence (Pathways KEGG). La visualisation du clustering hiĂ©rarchique avec des cartes de densitĂ© (heatmaps) illustre les avantages de l'utilisation de la mesure IntelliGO, par rapport Ă  trois autres mesures de similaritĂ©. Comme les gĂšnes peuvent souvent appartenir Ă  plus d'un cluster fonctionnel, la mĂ©thode C-means floue est Ă©galement appliquĂ©e Ă  l'ensemble des gĂšnes de la collection. Le choix du nombre optimal de clusters et la performance du clustering sont Ă©valuĂ©s par la mĂ©thode F-score en utilisant les classes de rĂ©fĂ©rence. Une analyse de recouvrement entre clusters et classes de rĂ©fĂ©rence est proposĂ©e pour faciliter des analyses ultĂ©rieures. Enfin, notre mĂ©thode est appliquĂ©e Ă  une liste de gĂšnes dĂ©rĂ©gulĂ©s, concernant le cancer colorectal. Dans ce cas, les classes de rĂ©fĂ©rence sont les profils d'expression de ces gĂšnes. L'analyse de recouvrement entre ces profils et les clusters fonctionnels obtenus avec la mĂ©thode C-means floue conduit Ă  caractĂ©riser des sousensembles de gĂšnes partageant Ă  la fois des fonctions biologiques communes et un comportement transcriptionel identique

    IntelliGO: a new vector-based semantic similarity measure including annotation origin

    Get PDF
    International audienceThe Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes). RESULTS: We present here a new semantic similarity measure called IntelliGO which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The IntelliGO similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO biological process and molecular function terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the IntelliGO similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the IntelliGO similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures. CONCLUSIONS: The IntelliGO similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering. AVAILABILITY: An on-line version of the IntelliGO similarity measure is available at: http://bioinfo.loria.fr/Members/benabdsi/intelligo_project

    Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering

    No full text
    Today, it is well known that taking into account the semantic information available for categorical variables sensibly improves the meaningfulness of the final results of any analysis. The paper presents a generalization of mixed Gibert's metrics, which originally handled numerical and categorical variables, to include also semantic variables. Semantic variables are defined as categorical variables related to a reference ontology (ontologies are formal structures to model semantic relationships between the concepts of a certain domain). The superconcept-based distance (SCD) is introduced to compare semantic variables taking into account the information provided by the reference ontology. A benchmark shows the good performance of SCD with respect to other proposals, taken from the literature, to compare semantic features. Mixed Gibert's metrics is generalized incorporating SCD. Finally, two real applications based on touristic data show the impact of the generalized Gibert's metrics in clustering procedures and, in consequence, the impact of taking into account the reference ontology in clustering. The main conclusion is that the reference ontology, when available, can sensibly improve the meaningfulness of the final clusters.Peer ReviewedPostprint (published version
    corecore