Doctor of Philosophy

Abstract

dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Similar works