88 research outputs found
Local matching learning of large scale biomedical ontologies
Les larges ontologies biomédicales décrivent généralement le même domaine d'intérêt, mais en utilisant des modèles de modélisation et des vocabulaires différents. Aligner ces ontologies qui sont complexes et hétérogènes est une tâche fastidieuse. Les systèmes de matching doivent fournir des résultats de haute qualité en tenant compte de la grande taille de ces ressources. Les systèmes de matching d'ontologies doivent résoudre deux problèmes: (i) intégrer la grande taille d'ontologies, (ii) automatiser le processus d'alignement. Le matching d'ontologies est une tâche difficile en raison de la large taille des ontologies. Les systèmes de matching d'ontologies combinent différents types de matcher pour résoudre ces problèmes. Les principaux problèmes de l'alignement de larges ontologies biomédicales sont: l'hétérogénéité conceptuelle, l'espace de recherche élevé et la qualité réduite des alignements résultants.
Les systèmes d'alignement d'ontologies combinent différents matchers afin de réduire l'hétérogénéité. Cette combinaison devrait définir le choix des matchers à combiner et le poids. Différents matchers traitent différents types d'hétérogénéité. Par conséquent, le paramétrage d'un matcher devrait être automatisé par les systèmes d'alignement d'ontologies afin d'obtenir une bonne qualité de correspondance. Nous avons proposé une approche appele "local matching learning" pour faire face à la fois à la grande taille des ontologies et au problème de l'automatisation. Nous divisons un gros problème d'alignement en un ensemble de problèmes d'alignement locaux plus petits. Chaque problème d'alignement local est indépendamment aligné par une approche d'apprentissage automatique. Nous réduisons l'énorme espace de recherche en un ensemble de taches de recherche de corresondances locales plus petites. Nous pouvons aligner efficacement chaque tache de recherche de corresondances locale pour obtenir une meilleure qualité de correspondance. Notre approche de partitionnement se base sur une nouvelle stratégie à découpes multiples générant des partitions non volumineuses et non isolées. Par conséquence, nous pouvons surmonter le problème de l'hétérogénéité conceptuelle. Le nouvel algorithme de partitionnement est basé sur le clustering hiérarchique par agglomération (CHA). Cette approche génère un ensemble de tâches de correspondance locale avec un taux de couverture suffisant avec aucune partition isolée.
Chaque tâche d'alignement local est automatiquement alignée en se basant sur les techniques d'apprentissage automatique. Un classificateur local aligne une seule tâche d'alignement local. Les classificateurs locaux sont basés sur des features élémentaires et structurelles. L'attribut class de chaque set de donne d'apprentissage " training set" est automatiquement étiqueté à l'aide d'une base de connaissances externe. Nous avons appliqué une technique de sélection de features pour chaque classificateur local afin de sélectionner les matchers appropriés pour chaque tâche d'alignement local. Cette approche réduit la complexité d'alignement et augmente la précision globale par rapport aux méthodes d'apprentissage traditionnelles. Nous avons prouvé que l'approche de partitionnement est meilleure que les approches actuelles en terme de précision, de taux de couverture et d'absence de partitions isolées. Nous avons évalué l'approche d'apprentissage d'alignement local à l'aide de diverses expériences basées sur des jeux de données d'OAEI 2018. Nous avons déduit qu'il est avantageux de diviser une grande tâche d'alignement d'ontologies en un ensemble de tâches d'alignement locaux. L'espace de recherche est réduit, ce qui réduit le nombre de faux négatifs et de faux positifs. L'application de techniques de sélection de caractéristiques à chaque classificateur local augmente la valeur de rappel pour chaque tâche d'alignement local.Although a considerable body of research work has addressed the problem of ontology matching, few studies have tackled the large ontologies used in the biomedical domain. We introduce a fully automated local matching learning approach that breaks down a large ontology matching task into a set of independent local sub-matching tasks. This approach integrates a novel partitioning algorithm as well as a set of matching learning techniques. The partitioning method is based on hierarchical clustering and does not generate isolated partitions. The matching learning approach employs different techniques: (i) local matching tasks are independently and automatically aligned using their local classifiers, which are based on local training sets built from element level and structure level features, (ii) resampling techniques are used to balance each local training set, and (iii) feature selection techniques are used to automatically select the appropriate tuning parameters for each local matching context. Our local matching learning approach generates a set of combined alignments from each local matching task, and experiments show that a multiple local classifier approach outperforms conventional, state-of-the-art approaches: these use a single classifier for the whole ontology matching task. In addition, focusing on context-aware local training sets based on local feature selection and resampling techniques significantly enhances the obtained results
Survey: Models and Prototypes of Schema Matching
Schema matching is critical problem within many applications to integration of data/information, to achieve interoperability, and other cases caused by schematic heterogeneity. Schema matching evolved from manual way on a specific domain, leading to a new models and methods that are semi-automatic and more general, so it is able to effectively direct the user within generate a mapping among elements of two the schema or ontologies better. This paper is a summary of literature review on models and prototypes on schema matching within the last 25 years to describe the progress of and research chalenge and opportunities on a new models, methods, and/or prototypes
Valentine: Evaluating Matching Techniques for Dataset Discovery
Data scientists today search large data lakes to discover and integrate
datasets. In order to bring together disparate data sources, dataset discovery
methods rely on some form of schema matching: the process of establishing
correspondences between datasets. Traditionally, schema matching has been used
to find matching pairs of columns between a source and a target schema.
However, the use of schema matching in dataset discovery methods differs from
its original use. Nowadays schema matching serves as a building block for
indicating and ranking inter-dataset relationships. Surprisingly, although a
discovery method's success relies highly on the quality of the underlying
matching algorithms, the latest discovery methods employ existing schema
matching algorithms in an ad-hoc fashion due to the lack of openly-available
datasets with ground truth, reference method implementations, and evaluation
metrics. In this paper, we aim to rectify the problem of evaluating the
effectiveness and efficiency of schema matching methods for the specific needs
of dataset discovery. To this end, we propose Valentine, an extensible
open-source experiment suite to execute and organize large-scale automated
matching experiments on tabular data. Valentine includes implementations of
seminal schema matching methods that we either implemented from scratch (due to
absence of open source code) or imported from open repositories. The
contributions of Valentine are: i) the definition of four schema matching
scenarios as encountered in dataset discovery methods, ii) a principled dataset
fabrication process tailored to the scope of dataset discovery methods and iii)
the most comprehensive evaluation of schema matching techniques to date,
offering insight on the strengths and weaknesses of existing techniques, that
can serve as a guide for employing schema matching in future dataset discovery
methods
Reconciling Matching Networks of Conceptual Models
Conceptual models such as database schemas, ontologies or process models have been established as a means for effective engineering of information systems. Yet, for complex systems, conceptual models are created by a variety of stakeholders, which calls for techniques to manage consistency among the different views on a system. Techniques for model matching generate correspondences between the elements of conceptual models, thereby supporting effective model creation, utilization, and evolution. Although various automatic matching tools have been developed for different types of conceptual models, their results are often incomplete or erroneous. Automatically generated correspondences, therefore, need to be reconciled, i.e., validated by a human expert. We analyze the reconciliation process in a network setting, where a large number of conceptual models need to be matched. Then, the network induced by the generated correspondences shall meet consistency expectations in terms of mutual reinforcing relations between the correspondences. We develop a probabilistic model to identify the most uncertain correspondences in order to guide the expert's validation work. We also show how to construct a set of high-quality correspondences, even if the expert does not validate all generated correspondences. We demonstrate the efficiency of our techniques for real-world datasets in the domains of schema matching and ontology alignment
Doctor of Philosophy
dissertationWith the steady increase in online shopping, more and more consumers are resorting to Product Search Engines and shopping sites such as Yahoo! Shopping, Google Product Search, and Bing Shopping as their first stop for purchasing goods online. These sites act as intermediaries between shoppers and merchants to drive user experience by enabling faceted search, comparison of products based on their specifications, and ranking of products based on their attributes. The success of these systems heavily relies on the variety and quality of the products that they present to users. In that sense, product catalogs are to online shopping what the Web index is to Web search. Therefore, comprehensive product catalogs are fundamental to the success of Product Search Engines. Given the large number of products and categories, and the speed at which they are released to the market, constructing and keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques that do not rely on human intervention. The main goal of this dissertation is to automatically construct catalogs for product search engines. To achieve this goal, the following problems must be addressed by these search engines: (i) product synthesis-creation of product instances that conform with the catalog schema; (ii) product discovery- derivation of product instances for products whose schemata are not present in the catalog; (iii) schema synthesis- construction of schemata for new product categories. We propose an end-to-end framework that automates, to a great extent, these tasks. We present a detailed experimental evaluation using real data sets which shows that our framework is effective, scaling to a large number of products and categories, and resilient to noise that is inherent in Web data
Evaluation of process model matching techniques
Business process models are commonly used to document a company's operations. They describe internal processes in a chronological and logical order. Business process model matching refers to the automatic detection of semantically similar correspondences in process models. The output of those matching techniques is the basis for many applications. Currently, most research effort has been undertaken to improve the performance of such matching techniques. However, to support the improvement of process model matching techniques further, efficient and fair evaluation strategies are required. Moreover, information about the matching task, regarding the complexity of a data set have to be gathered. In the current literature, complexity is mostly associated with different level of granularity, thus 1:m or n:m correspondences. However, the evaluation should also account for different complexity aspects of the matching task, for example syntactical overlap of correspondences. Moreover, the evaluation of matching results actually strongly depends on the application. In this thesis, we therefore propose an application dependent evaluation. On the one hand, we introduce a non-binary evaluation, which better reflects the uncertainty of a gold standard and propose evaluation metrics, based on this non-binary gold standard which take different application scenarios into account.
On the other hand, we propose a conceptually novel evaluation procedure, which offers detailed information about strength and weaknesses of matchers without manually processing the matcher output. It therefore helps to find optimal application scenarios for specific matching techniques. It can further serve as a basis for a prediction for future matching tasks. We conduct experiments to show the insights gained by the introduced evaluation metrics and methods. Moreover, we apply the metrics at the OAEI 2016 and 2017
Learning To Scale Up Search-Driven Data Integration
A recent movement to tackle the long-standing data integration problem is a compositional and iterative approach, termed “pay-as-you-go” data integration. Under this model, the objective is to immediately support queries over “partly integrated” data, and to enable the user community to drive integration of the data that relate to their actual information needs. Over time, data will be gradually integrated.
While the pay-as-you-go vision has been well-articulated for some time, only recently have we begun to understand how it can be manifested into a system implementation. One branch of this effort has focused on enabling queries through keyword search-driven data integration, in which users pose queries over partly integrated data encoded as a graph, receive ranked answers generated from data and metadata that is linked at query-time, and provide feedback on those answers. From this user feedback, the system learns to repair bad schema matches or record links.
Many real world issues of uncertainty and diversity in search-driven integration remain open. Such tasks in search-driven integration require a combination of human guidance and machine learning. The challenge is how to make maximal use of limited human input. This thesis develops three methods to scale up search-driven integration, through learning from expert feedback: (1) active learning techniques to repair links from small amounts of user feedback; (2) collaborative learning techniques to combine users’ conflicting feedback; and (3) debugging techniques to identify where data experts could best improve integration
quality. We implement these methods within the Q System, a prototype of search-driven integration, and validate their effectiveness over real-world datasets
- …