652 research outputs found

    Multilingual Schema Matching for Wikipedia Infoboxes

    Full text link
    Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.Comment: VLDB201

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Database Integration: the Key to Data Interoperability

    Get PDF
    Most of new databases are no more built from scratch, but re-use existing data from several autonomous data stores. To facilitate application development, the data to be re-used should preferably be redefined as a virtual database, providing for the logical unification of the underlying data sets. This unification process is called database integration. This chapter provides a global picture of the issues raised and the approaches that have been proposed to tackle the problem

    Database Integration: an Overview of Issues and Approaches

    Get PDF
    In many large companies the widespread usage of computers has led a number of different application-specific databases to be installed. As company structures evolve, boundaries between departments move, creating new business units. Their new applications will use existing data from various data stores, rather than new data entering the organization. Henceforth, the ability to make data stores interoperable becomes a crucial factor for the development of new information systems. Data interoperability may come in various degrees. At the lowest level, commercial gateways connect specific pairs of database management systems (DBMSs). Software providing facilities for defining persistent views over different databases [6] simplifies access to distant data but does not support automatic enforcement of consistency constraints among different databases. Full interoperability is achieved by distributed or federated database systems, which support integration of existing data into virtual databases (i.e. databases which are logically defined but not physically materialized). The latter allow existing databases to remain under control of their respective owners, thus supporting a harmonious coexistence of scalable data integration and site autonomy requirements [9]. Federated systems are very popular today. However, before they become marketable, many issues remain to be solved. Design issues focus on either human-centered aspects (cooperative work, including autonomy issues and negotiation procedures) or database-centered aspects (data integration, schema/database evolution). Operational issues investigate system interoperability mainly in terms of support of new transaction types, new query processing algorithms, security concerns, etc. General overviews may be found elsewhere [4, 9]. This paper is devoted to database integration, possibly the most critical issue. Simply stated, database integration is the process which takes as input a set of databases, and produces as output a single unified description of the input schemas (the integrated schema) and the associated mapping information supporting integrated access to existing data through the integrated schema. As such, database integration is also used in the process of re-engineering an exist i ng l egacy system. Database integration has attracted many diverse and diverging contributions. The purpose, and the main intended contribution of this article is to provide a clear picture of what are the approaches and the current solutions and what remains to be achieved

    Ontology alignment based on word embedding and random forest classification.

    Get PDF
    Ontology alignment is crucial for integrating heterogeneous data sources and forms an important component for realising the goals of the semantic web. Accordingly, several ontology alignment techniques have been proposed and used for discovering correspondences between the concepts (or entities) of different ontologies. However, these techniques mostly depend on string-based similarities which are unable to handle the vocabulary mismatch problem. Also, determining which similarity measures to use and how to effectively combine them in alignment systems are challenges that have persisted in this area. In this work, we introduce a random forest classifier approach for ontology alignment which relies on word embedding to discover semantic similarities between concepts. Specifically, we combine string-based and semantic similarity measures to form feature vectors that are used by the classifier model to determine when concepts match. By harnessing background knowledge and relying on minimal information from the ontologies, our approach can deal with knowledge-light ontological resources. It also eliminates the need for learning the aggregation weights of multiple similarity measures. Our experiments using Ontology Alignment Evaluation Initiative (OAEI) dataset and real-world ontologies highlight the utility of our approach and show that it can outperform state-of-the-art alignment systems

    Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web

    Get PDF
    The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized. ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery. First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems

    Distributed databases

    Get PDF
    Mòdul 3 del llibre Database Architecture. UOC, 20122022/202

    Multilingual Experimental Literature and Transnational Feminist Solidarities: ErĂ­n Moure and Kathy Acker

    Get PDF
    The impulse toward multilingual writing has arisen as a prominent trend in contemporary women’s writing. Criticism and notions of the literary have to respond to, among other things, the fact that we live in a world where a significant portion of the population is at least partially bi or multilingual (Camboni 34). To be responsive to the increasing multilingualism of writers necessitates new strategies for reading the polyvocality of texts (Eagleton and Friedman 3). This paper considers the ways multilingual writing creates, “small scale modes of listening” (Maguire xix) that tune the reader to languages, identities, and cultures under erasure. Erín Moure’s multilingual repertoire includes writing in French, Galician, Spanish, Portuguese, Portunhol, and Romanian with fragments of Polish, Ukrainian, Yiddish, Hebrew, Russian, and Latin marking the Ukrainian setting in which the Elisa Sampedrín stories take place, as well as quotes in German, some Kanji and Greek. Kathy Acker employs a sophisticated multilingual register in her late corpus work that includes French, Spanish, German, and Farsi. This paper explores the ways multilingual writing creates the conditions for subaltern audibility, thereby setting the grounds for transnational feminist solidarities
    • …
    corecore