521 research outputs found

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Schema-agnostic entity retrieval in highly heterogeneous semi-structured environments

    Get PDF
    [no abstract

    Reasoning & Querying – State of the Art

    Get PDF
    Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

    Detecting semantically related concepts in a SOA integration scenario

    Get PDF
    In this paper, we present an approach to detecting semantically related concepts in a service oriented environment. This method is essential when creating collaborative business processes. Standard enterprise application systems such as enterprise resource planning (ERP), customer relationship management (CRM), supply chain management (SCM) etc. offer a lot of opportunities for application interoperability. System integrators assign a set of services from various application systems to the integration scenario. A well defined discovery process can detect these services. Nevertheless, building an operable business process requires the mapping of these services in the data schema used in the business process. This mapping results in a global understanding of relevant business concepts in the integration scenario. This paper focuses on the identification of semantically relevant concepts in different schemas in the participating services. A short overview of our integration platform and methodology is also included

    Approximating expressive queries on graph-modeled data: The GeX approach

    Get PDF
    We present the GeX (Graph-eXplorer) approach for the approximate matching of complex queries on graph-modeled data. GeX generalizes existing approaches and provides for a highly expressive graph-based query language that supports queries ranging from keyword-based to structured ones. The GeX query answering model gracefully blends label approximation with structural relaxation, under the primary objective of delivering meaningfully approximated results only. GeX implements ad-hoc data structures that are exploited by a top-k retrieval algorithm which enhances the approximate matching of complex queries. An extensive experimental evaluation on real world datasets demonstrates the efficiency of the GeX query answering


    Get PDF
    This paper proposes a preliminary analysis of whether a schema matching approach can be applied for the comparison and possible the selection of geospatial standards. Schema matching is tested in the context of underground utility network modelling and, as an initial experiment, three geospatial standards are compared with user requirements: CityGML UtilityNetwork ADE, infraGML and IFC. The schema comparison is enabled by XSD files, and carried out from syntactic, structural and semantic points of view, making use of existing software. The findings of this preliminary investigation show that schema matching is applicable for the comparison of user needs and existing geospatial standards, and does show some potential, but the matching results are varied and not easy to interpret. In particular, the similarity scores between user needs and standards are very low and the comparison and the selection is not straightforward. Having a strategy - an iterative process - is required. While for this preliminary examination, the focus of this paper is on assessing the schema matching approach (which parameters to take into consideration, how to proceed, tools available, automation aspect), further work will include examining software options and performance, as well as exploring how to take the relatively complex preliminary results obtained here and use them to assist the selection of a specific standard

    A Survey on Intent-based Diversification for Fuzzy Keyword Search

    Get PDF
    Keyword search is an interesting phenomenon, it is the process of finding important and relevant information from various data repositories. Structured and semistructured data can precisely be stored. Fully unstructured documents can annotate and be stored in the form of metadata. For the total web search, half of the web search is for information exploration process. In this paper, the earlier works for semantic meaning of keywords based on their context in the specified documents are thoroughly analyzed. In a tree data representation, the nodes are objects and could hold some intention. These nodes act as anchors for a Smallest Lowest Common Ancestor (SLCA) based pruning process. Based on their features, nodes are clustered. The feature is a distinctive attribute, it is the quality, property or traits of something. Automatic text classification algorithms are the modern way for feature extraction. Summarization and segmentation produce n consecutive grams from various forms of documents. The set of items which describe and summarize one important aspect of a query is known as the facet. Instead of exact string matching a fuzzy mapping based on semantic correlation is the new trend, whereas the correlation is quantified by cosine similarity. Once the outlier is detected, nearest neighbors of the selected points are mapped to the same hash code of the intend nodes with high probability. These methods collectively retrieve the relevant data and prune out the unnecessary data, and at the same time create a hash signature for the nearest neighbor search. This survey emphasizes the need for a framework for fuzzy oriented keyword search

    Social Network Data Management

    Get PDF
    With the increasing usage of online social networks and the semantic web's graph structured RDF framework, and the rising adoption of networks in various fields from biology to social science, there is a rapidly growing need for indexing, querying, and analyzing massive graph structured data. Facebook has amassed over 500 million users creating huge volumes of highly connected data. Governments have made RDF datasets containing billions of triples available to the public. In the life sciences, researches have started to connect disparate data sets of research results into one giant network of valuable information. Clearly, networks are becoming increasingly popular and growing rapidly in size, requiring scalable solutions for network data management. This thesis focuses on the following aspects of network data management. We present a hierarchical index structure for external memory storage of network data that aims to maximize data locality. We propose efficient algorithms to answer subgraph matching queries against network databases and discuss effective pruning strategies to improve performance. We show how adaptive cost models can speed up subgraph matching query answering by assigning budgets to index retrieval operations and adjusting the query plan while executing. We develop a cloud oriented social network database, COSI, which handles massive network datasets too large for a single computer by partitioning the data across multiple machines and achieving high performance query answering through asynchronous parallelization and cluster-aware heuristics. Tracking multiple standing queries against a social network database is much faster with our novel multi-view maintenance algorithm, which exploits common substructures between queries. To capture uncertainty inherent in social network querying, we define probabilistic subgraph matching queries over deterministic graph data and propose algorithms to answer them efficiently. Finally, we introduce a general relational machine learning framework and rule-based language, Probabilistic Soft Logic, to learn from and probabilistically reason about social network data and describe applications to information integration and information fusion

    Äriprotsessimudelite ühildamine

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Ettevõtted, kellel on aastatepikkune kogemus äriprotsesside haldamises, omavad sageli protsesside repositooriumeid, mis võivad endas sisaldada sadu või isegi tuhandeid äriprotsessimudeleid. Need mudelid pärinevad erinevatest allikatest ja need on loonud ning neid on muutnud erinevad osapooled, kellel on erinevad modelleerimise oskused ning praktikad. üheks sagedaseks praktikaks on uute mudelite loomine, kasutades olemasolevaid mudeleid, kopeerides neist fragmente ning neid seejärel muutes. See omakorda loob olukorra, kus protsessimudelite repositoorium sisaldab mudeleid, milles on identseid mudeli fragmente, mis viitavad samale alamprotsessile. Kui sellised fragmendid jätta konsolideerimata, siis võib see põhjustada repositooriumis ebakõlasid -- üks ja sama alamprotsess võib olla erinevates protsessides erinevalt kirjeldatud. Sageli on ettevõtetel mudelid, millel on sarnased eesmärgid, kuid mis on mõeldud erinevate klientide, toodete, äriüksuste või geograafiliste regioonide jaoks. Näiteks on äriprotsessid kodukindlustuse ja autokindlustuse jaoks sama ärilise eesmärgiga. Loomulikult sisaldavad nende protsesside mudelid mitmeid identseid alamfragmente (nagu näiteks poliisi andmete kontrollimine), samas on need protsessid mitmes punktis erinevad. Nende protsesside eraldi haldamine on ebaefektiivne ning tekitab liiasusi. Doktoritöös otsisime vastust küsimusele: kuidas identifitseerida protsessimudelite repositooriumis korduvaid mudelite fragmente, ning üldisemalt -- kuidas leida ning konsolideerida sarnasusi suurtes äriprotsessimudelite repositooriumites? Doktoritöös on sisse toodud kaks üksteist täiendavat meetodit äriprotsessimudelite konsolideerimiseks, täpsemalt protsessimudelite ühildamine üheks mudeliks ning mudelifragmentide ekstraktimine. Esimene neist võtab sisendiks kaks või enam protsessimudelit ning konstrueerib neist ühe konsolideeritud protsessimudeli, mis sisaldab kõikide sisendmudelite käitumist. Selline lähenemine võimaldab analüütikutel hallata korraga tervet perekonda sarnaseid mudeleid ning neid muuta sünkroniseeritud viisil. Teine lähenemine, alamprotsesside ekstraktimine, sisaldab endas sagedasti esinevate fragmentide identifitseerimist (protsessimudelites kloonide leidmist) ning nende kapseldamist alamprotsessideks