6 research outputs found

    Publish-Time Data Integration for Open Data Platforms

    Get PDF
    Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    Closing Information Gaps with Need-driven Knowledge Sharing

    Get PDF
    InformationslĂŒcken schließen durch bedarfsgetriebenen Wissensaustausch Systeme zum asynchronen Wissensaustausch – wie Intranets, Wikis oder Dateiserver – leiden hĂ€ufig unter mangelnden NutzerbeitrĂ€gen. Ein Hauptgrund dafĂŒr ist, dass Informationsanbieter von Informationsuchenden entkoppelt, und deshalb nur wenig ĂŒber deren Informationsbedarf gewahr sind. Zentrale Fragen des Wissensmanagements sind daher, welches Wissen besonders wertvoll ist und mit welchen Mitteln WissenstrĂ€ger dazu motiviert werden können, es zu teilen. Diese Arbeit entwirft dazu den Ansatz des bedarfsgetriebenen Wissensaustauschs (NKS), der aus drei Elementen besteht. ZunĂ€chst werden dabei Indikatoren fĂŒr den Informationsbedarf erhoben – insbesondere Suchanfragen – ĂŒber deren Aggregation eine fortlaufende Prognose des organisationalen Informationsbedarfs (OIN) abgeleitet wird. Durch den Abgleich mit vorhandenen Informationen in persönlichen und geteilten InformationsrĂ€umen werden daraus organisationale InformationslĂŒcken (OIG) ermittelt, die auf fehlende Informationen hindeuten. Diese LĂŒcken werden mit Hilfe so genannter Mediationsdienste und MediationsrĂ€ume transparent gemacht. Diese helfen Aufmerksamkeit fĂŒr organisationale InformationsbedĂŒrfnisse zu schaffen und den Wissensaustausch zu steuern. Die konkrete Umsetzung von NKS wird durch drei unterschiedliche Anwendungen illustriert, die allesamt auf bewĂ€hrten Wissensmanagementsystemen aufbauen. Bei der Inversen Suche handelt es sich um ein Werkzeug das WissenstrĂ€gern vorschlĂ€gt Dokumente aus ihrem persönlichen Informationsraum zu teilen, um damit organisationale InformationslĂŒcken zu schließen. Woogle erweitert herkömmliche Wiki-Systeme um Steuerungsinstrumente zur Erkennung und Priorisierung fehlender Informationen, so dass die Weiterentwicklung der Wiki-Inhalte nachfrageorientiert gestaltet werden kann. Auf Ă€hnliche Weise steuert Semantic Need, eine Erweiterung fĂŒr Semantic MediaWiki, die Erfassung von strukturierten, semantischen Daten basierend auf Informationsbedarf der in Form strukturierter Anfragen vorliegt. Die Umsetzung und Evaluation der drei Werkzeuge zeigt, dass bedarfsgetriebener Wissensaustausch technisch realisierbar ist und eine wichtige ErgĂ€nzung fĂŒr das Wissensmanagement sein kann. DarĂŒber hinaus bietet das Konzept der Mediationsdienste und MediationsrĂ€ume einen Rahmen fĂŒr die Analyse und Gestaltung von Werkzeugen gemĂ€ĂŸ der NKS-Prinzipien. Schließlich liefert der hier vorstellte Ansatz auch Impulse fĂŒr die Weiterentwicklung von Internetdiensten und -Infrastrukturen wie der Wikipedia oder dem Semantic Web

    User satisfaction model to measure open government data usage

    Get PDF
    The open government data (OGD) initiative is presented by the government of any country to achieve promotion of transparency, social control and citizens participation in policy making. The use of OGD in Malaysia is still in its early stage and facing problems such as less participation, security issues, and lack of awareness. While most of the research in Information Communication Technology (ICT) that underpinned by Expectation Confirmation Theory (ECT) are focused on user satisfaction and determination of users’ reuse intention, this study focus on the direct antecedents of OGD users’ intention to use and its influence on OGD users’ satisfaction, as this research is still scarce. This research aims to examine ECT model on users’ satisfaction mediated by the intention to use the open government data (OGD). The objectives of this research are in three folds; (1) to design an integrated ECT and TAM models for explaining OGD satisfaction, (2) to examine the mediating role of citizens’ behavioural intention between the expectations, confirmation, perceived performance, incentive on usage, perceived risk and citizen’s satisfaction of open government data, (3) and to validate the impact of incentives on usage and perceived risk in explaining the new ECT model in the OGD context. Data were collected from 250 samples of OGD users in Malaysia. Empirical evidences were gathered through self-administered questionnaires using the Likert scale. The data were analysed using Partially Least Square Structural Equation Modelling (PLS-SEM) in order to test the model. The final model was verified by experts in the area. Results revealed that expectation has significant relationship with confirmation, but perceived performance showed insignificant relationship with confirmation which serves as a unique finding. Additionally, confirmation, expectation, perceived performance, incentive on usage and perceived risk has significant relationship with intention to use OGD. Meanwhile, the analysis proved that the intention to use mediates the relationship between confirmation, expectation, perceived performance, incentive on usage, perceived risk and satisfaction on use of OGD. This study suggests that the user’s expectations on OGD must be met in creating stronger intention and satisfaction. The implications of the study are to improve data service quality, support innovative services development, increase data transparency, and boost up potential investment
    corecore