360 research outputs found

    An unsupervised data-driven method to discover equivalent relations in large linked datasets

    Get PDF
    This article addresses a number of limitations of state-of-the-art methods of Ontology Alignment: 1) they primarily address concepts and entities while relations are less well-studied; 2) many build on the assumption of the ‘well-formedness’ of ontologies which is unnecessarily true in the domain of Linked Open Data; 3) few have looked at schema heterogeneity from a single source, which is also a common issue particularly in very large Linked Dataset created automatically from heterogeneous resources, or integrated from multiple datasets. We propose a domain- and language-independent and completely unsupervised method to align equivalent relations across schemata based on their shared instances. We introduce a novel similarity measure able to cope with unbalanced population of schema elements, an unsupervised technique to automatically decide similarity threshold to assert equivalence for a pair of relations, and an unsupervised clustering process to discover groups of equivalent relations across different schemata. Although the method is designed for aligning relations within a single dataset, it can also be adapted for cross-dataset alignment where sameAs links between datasets have been established. Using three gold standards created based on DBpedia, we obtain encouraging results from a thorough evaluation involving four baseline similarity measures and over 15 comparative models based on variants of the proposed method. The proposed method makes significant improvement over baseline models in terms of F1 measure (mostly between 7% and 40%), and it always scores the highest precision and is also among the top performers in terms of recall. We also make public the datasets used in this work, which we believe make the largest collection of gold standards for evaluating relation alignment in the LOD context

    Parallel Mapper

    Full text link
    The construction of Mapper has emerged in the last decade as a powerful and effective topological data analysis tool that approximates and generalizes other topological summaries, such as the Reeb graph, the contour tree, split, and joint trees. In this paper, we study the parallel analysis of the construction of Mapper. We give a provably correct parallel algorithm to execute Mapper on multiple processors and discuss the performance results that compare our approach to a reference sequential Mapper implementation. We report the performance experiments that demonstrate the efficiency of our method

    Una taxonomía multidimensional de Estados desarrollistas

    Get PDF
    ABSTRACT. This paper proposes a new approach to the classification of Developmental States (DS) based on their public efforts to foster human development. We conceptualize DS within a multidimensional framework that includes three main dimensions (economic, social and democratic), and run a hierarchical cluster analysis for 112 countries in order to build a multidimensional taxonomy of DS. We propose a country classification and characterize three country-groups with different developmental public efforts: i) the human development States; ii) the unbalanced developmental States and iii) the non-developmental States. Our multidimensional taxonomy offers a more complex understanding of the variety of public efforts devoted to promote human development, thus overcoming the restricted - economical - conception of DS, which is mainly focused to the East Asian region. Key Words: developmental states; multidimensional taxonomy; social equality and democratic participation; welfare states; economic growth.RESUMEN. Este trabajo propone un nuevo marco para clasificar a los Estados Desarrollistas (ED) basado en sus esfuerzos para mejorar el desarrollo humano. Se conceptualiza a los ED en un marco multidimensional con tres dimensiones principales (económica, social y democrática) y se realizó un análisis de clúster jerárquico para 112 economías con el fin de construir dicha taxonomía. Se propone una clasificación por país y se clasifican tres grupos en función de sus esfuerzos desarrollistas: i) los Estados de desarrollo humano; ii) los Estados desarrollistas desbalanceados y iii) los Estados nodesarrollistas. La taxonomía multidimensional ofrece un entendimiento más complejo de la variedad de esfuerzos públicos para promover el desarrollo humano, superando así la concepción - económica - restringida de los ED prevaleciente en la región del Este Asiático

    The 2D shape structure dataset: A user annotated open access database

    Get PDF
    International audienceIn this paper we present the 2D Shape Structure database, a public, user-generated dataset of 2D shape decompositions into a hierarchy of shape parts with geometric relationships retained. It is the outcome of a large-scale user study obtained by crowdsourcing, involving over 1200 shapes in 70 shape classes, and 2861 participants. A total of 41953 annotations has been collected with at least 24 annotations per shape. For each shape, user decompositions into main shape, one or more levels of parts, and a level of details are available. This database reinforces a philosophy that understanding shape structure as a whole, rather than in the separated categories of parts decomposition, parts hierarchy, and analysis of relationships between parts, is crucial for full shape understanding. We provide initial statistical explorations of the data to determine representative (" mean ") shape annotations and to determine the number of modes in the annotations. The primary goal of the paper is to make this rich and complex database openly available (through the website http://2dshapesstructure.github.io/index.html), providing the shape community with a ground truth of human perception of holistic shape structure

    Recovering the number of clusters in data sets with noise features using feature rescaling factors

    Get PDF
    In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.Peer reviewe

    MaxMin Linear Initialization for Fuzzy C-Means

    Get PDF
    International audienceClustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering. Fuzzy clustering methods, where each data point can belong to several clusters, are an interesting alternative. Yet, seeding iterative fuzzy algorithms to achieve high quality clustering is an issue. In this paper, we propose a new linear and efficient initialization algorithm MaxMin Linear to deal with this problem. Then, we validate our theoretical results through extensive experiments on a variety of numerical real-world and artificial datasets. We also test several validity indices, including a new validity index that we propose, Transformed Standardized Fuzzy Difference (TSFD)

    Chronic pelvic pain in women of reproductive and post-reproductive age : a population-based study

    Get PDF
    Background Epidemiological studies on chronic pelvic pain (CPP) have focused on women of reproductive age. We aimed to determine the prevalence of chronic pelvic pain (CPP) in adult women and the differences in associated factors among women of reproductive age and older women. In addition, to determine whether distinct subgroups existed among CPP cases. Methods A cross-sectional postal survey was conducted among 5300 randomly selected women aged ≥25 years resident in the Grampian region, UK. Multivariable logistic regression was used to determine pregnancy-related and psychosocial factors associated with CPP. To identify subgroups of CPP cases, we performed cluster analysis using variables of pain severity, psychosocial factors and pain coping strategies. Results Of 2088 participants, 309 (14.8%) reported CPP. CPP was significantly associated with being of reproductive age (odds ratios (OR) 2.43, 95% CI 1.69–3.48), multiple non-pain somatic symptoms (OR 3.58 95% CI 2.23–5.75), having fatigue (OR mild 1.74 95% CI 1.24–2.44, moderate/severe 1.82, 95% CI 1.25–2.63) and having depression (OR 1.61, 95% CI 1.09–2.38). CPP was less associated with multiple non-pain somatic symptoms in women of reproductive age compared to older women (interaction OR 0.51, 95% CI 0.28–0.92). We identified two clusters of CPP cases; those having little/no psychosocial distress and those having high psychosocial distress. Conclusion CPP is common in both age groups, though women of reproductive age are more likely to report it. Heightened somatic awareness may be more strongly associated with CPP in older women. There are distinct groups of CPP cases characterized by the absence/presence of psychosocial distress

    Contextual and Behavioral Customer Journey Discovery Using a Genetic Approach

    Get PDF
    With the advent of new technologies and the increase in customers’ expectations, services are becoming more complex. This complexity calls for new methods to understand, analyze, and improve service delivery. Summarizing customers’ experience using representative journeys that are displayed on a Customer Journey Map (CJM) is one of these techniques. We propose a genetic algorithm that automatically builds a CJM from raw customer experience recorded in a database. Mining representative journeys can be seen a clustering task where both the sequence of activities and some contextual data (e.g., demographics) are considered when measuring the similarity between journeys. We show that our genetic approach outperforms traditional ways of handling this clustering task. Moreover, we apply our algorithm on a real dataset to highlight the benefit of using a genetic approach
    corecore