Search CORE

6 research outputs found

Dataset search: a survey

Author: Chapman Adriane
Groth Paul
Ibáñez-Gonzalez Luis-Daniel
Kacprzak Emilia
Koesten Laura
Konstantinidis George
Simperl Elena
Publication venue
Publication date: 03/01/2019
Field of study

Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference

arXiv.org e-Print Archive

King's Research Portal

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Distributed Optimization and Data Market Design

Author: London Palma Alise den Nijs
Publication venue
Publication date: 01/01/2017
Field of study

We consider algorithms for distributed optimization and their applications. In this thesis, we propose a new approach for distributed optimization based on an emerging area of theoretical computer science – local computation algorithms. The approach is fundamentally different from existing methodologies and provides a number of benefits, such as robustness to link failure and adaptivity to dynamic settings. Specifically, we develop an algorithm, LOCO, that given a convex optimization problem P with n variables and a “sparse” linear constraint matrix with m constraints, provably finds a solution as good as that of the best online algorithm for P using only O(log(n + m)) messages with high probability. The approach is not iterative and communication is restricted to a localized neighborhood. In addition to analytic results, we show numerically that the performance improvements over classical approaches for distributed optimization are significant, e.g., it uses orders of magnitude less communication than ADMM. We also consider the operations of a geographically distributed cloud data market. We consider design decisions that include which data to purchase (data purchasing) and where to place or replicate the data for delivery (data placement). We show that a joint approach to data purchasing and data placement within a cloud data market improves operating costs. This problem can be viewed as a facility location problem, and is thus NP-hard. However, we give a provably optimal algorithm for the case of a data market consisting of a single data center, and then generalize the result from the single data center setting in order to develop a near-optimal, polynomial-time algorithm for a geo-distributed data market. The resulting design, Datum, decomposes the joint purchasing and placement problem into two subproblems, one for data purchasing and one for data placement, using a transformation of the underlying bandwidth costs. We show, via a case study, that Datum is near-optimal (within 1.6%) in practical settings.</p

Caltech Theses and Dissertations

Recommended from our members

Recommender systems and market approaches for industrial data management

Author: Jess Torben
Publication venue: University of Cambridge
Publication date: 08/12/2017
Field of study

Industrial companies are dealing with an increasing data overload problem in all aspects of their business: vast amounts of data are generated in and outside each company. Determining which data is relevant and how to get it to the right users is becoming increasingly difficult. There are a large number of datasets to be considered, and an even higher number of combinations of datasets that each user could be using. Current techniques to address this data overload problem necessitate detailed analysis. These techniques have limited scalability due to their manual effort and their complexity, which makes them unpractical for a large number of datasets. Search, the alternative used by many users, is limited by the user’s knowledge about the available data and does not consider the relevance or costs of providing these datasets. Recommender systems and so-called market approaches have previously been used to solve this type of resource allocation problem, as shown for example in allocation of equipment for production processes in manufacturing or for spare part supplier selection. They can therefore also be seen as a potential application for the problem of data overload. This thesis introduces the so-called RecorDa approach: an architecture using market approaches and recommender systems on their own or by combining them into one system. Its purpose is to identify which data is more relevant for a user’s decision and improve allocation of relevant data to users. Using a combination of case studies and experiments, this thesis develops and tests the approach. It further compares RecorDa to search and other mechanisms. The results indicate that RecorDa can provide significant benefit to users with easier and more flexible access to relevant datasets compared to other techniques, such as search in these databases. It is able to provide a fast increase in precision and recall of relevant datasets while still keeping high novelty and coverage of a large variety of datasets

Apollo (Cambridge)