7 research outputs found

    Methoden zur Identifikation relevanter Datenquellen: Eine Literaturanalyse

    Get PDF
    Die zunehmende Anzahl an zu verarbeitenden Unternehmensdaten und die steigende strategische Relevanz des Informationsmanagements formulieren einen Bedarf zur Systematisierung des Prozesses zur Identifikation und Selektion relevanter Datenquellen. Obwohl die Datenquellenauswahl eine zentrale Aufgabe im Management der Informationswirtschaft (im Rahmen des strategischen Informationsmanagements) darstellt, fehlt es einer aktuellen Betrachtung zum Stand der Wissenschaft im Hinblick vorhandener Methoden zur Selektion relevanter Datenquellen. Der vorliegende Beitrag schließt diese Forschungslücke und stellt den aktuellen Stand der Wissenschaft zu vorhandenen Methoden zur Identifikation und Selektion relevanter Datenquellen dar. Mittels einer Literaturanalyse wurden insgesamt 37 wissenschaftliche Beiträge identifiziert, welche acht Methoden zur Datenquellenauswahl beschreiben. Die identifizierten Methoden wurden anschließend den Kategorien automatisierte, semi-automatisierte und manuelle Verfahren zugeordnet. Dabei konnte ebenfalls eine zunehmende Tendenz zu automatisierten Methoden zur Datenquellenauswahl beobachtet werden

    On optimality of jury selection in crowdsourcing

    Get PDF
    Recent advances in crowdsourcing technologies enable computationally challenging tasks (e.g., sentiment analysis and entity resolution) to be performed by Internet workers, driven mainly by monetary incentives. A fundamental question is: how should workers be selected, so that the tasks in hand can be accomplished successfully and economically? In this paper, we study the Jury Selection Problem (JSP): Given a monetary budget, and a set of decision-making tasks (e.g., “Is Bill Gates still the CEO of Microsoft now?”), return the set of workers (called jury), such that their answers yield the highest “Jury Quality” (or JQ). Existing JSP solutions make use of the Majority Voting (MV) strategy, which uses the answer chosen by the largest number of workers. We show that MV does not yield the best solution for JSP. We further prove that among all voting strategies (including deterministic and randomized strategies), Bayesian Voting (BV) can optimally solve JSP. We then examine how to solve JSP based on BV. This is technically challenging, since computing the JQ with BV is NP-hard. We solve this problem by proposing an approximate algorithm that is computationally efficient. Our approximate JQ computation algorithm is also highly accurate, and its error is proved to be bounded within 1%. We extend our solution by considering the task owner’s “belief” (or prior) on the answers of the tasks. Experiments on synthetic and real datasets show that our new approach is consistently better than the best JSP solution known.published_or_final_versio

    Augmenting cross-domain knowledge bases using web tables

    Get PDF
    Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web. This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion. When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps. Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema. By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts. Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale. In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables

    Characterizing and selecting fresh data sources

    No full text
    corecore