6 research outputs found
Dataset search: a survey
Generating value from data requires the ability to find, access and make
sense of datasets. There are many efforts underway to encourage data sharing
and reuse, from scientific publishers asking authors to submit data alongside
manuscripts to data marketplaces, open data portals and data communities.
Google recently beta released a search service for datasets, which allows users
to discover data stored in various online repositories via keyword queries.
These developments foreshadow an emerging research field around dataset search
or retrieval that broadly encompasses frameworks, methods and tools that help
match a user data need against a collection of datasets. Here, we survey the
state of the art of research and commercial systems in dataset retrieval. We
identify what makes dataset search a research field in its own right, with
unique challenges and methods and highlight open problems. We look at
approaches and implementations from related areas dataset search is drawing
upon, including information retrieval, databases, entity-centric and tabular
search in order to identify possible paths to resolve these open problems as
well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference
Distributed Optimization and Data Market Design
We consider algorithms for distributed optimization and their applications. In this thesis, we propose a new approach for distributed optimization based on an emerging area of theoretical computer science – local computation algorithms. The approach is fundamentally different from existing methodologies and provides a number of benefits, such as robustness to link failure and adaptivity to dynamic settings. Specifically, we develop an algorithm, LOCO, that given a convex optimization problem P with n variables and a “sparse” linear constraint matrix with m constraints, provably finds a solution as good as that of the best online algorithm for P using only O(log(n + m)) messages with high probability. The approach is not iterative and communication is restricted to a localized neighborhood. In addition to analytic results, we show numerically that the performance improvements over classical approaches for distributed optimization are significant, e.g., it uses orders of magnitude less communication than ADMM.
We also consider the operations of a geographically distributed cloud data market. We consider design decisions that include which data to purchase (data purchasing) and where to place or replicate the data for delivery (data placement). We show that a joint approach to data purchasing and data placement within a cloud data market improves operating costs. This problem can be viewed as a facility location problem, and is thus NP-hard. However, we give a provably optimal algorithm for the case of a data market consisting of a single data center, and then generalize the result from the single data center setting in order to develop a near-optimal, polynomial-time algorithm for a geo-distributed data market. The resulting design, Datum, decomposes the joint purchasing and placement problem into two subproblems, one for data purchasing and one for data placement, using a transformation of the underlying bandwidth costs. We show, via a case study, that Datum is near-optimal (within 1.6%) in practical settings.</p
Recommended from our members
Recommender systems and market approaches for industrial data management
Industrial companies are dealing with an increasing data overload problem in all
aspects of their business: vast amounts of data are generated in and outside each
company. Determining which data is relevant and how to get it to the right users is
becoming increasingly difficult. There are a large number of datasets to be
considered, and an even higher number of combinations of datasets that each user
could be using.
Current techniques to address this data overload problem necessitate detailed
analysis. These techniques have limited scalability due to their manual effort and
their complexity, which makes them unpractical for a large number of datasets.
Search, the alternative used by many users, is limited by the user’s knowledge
about the available data and does not consider the relevance or costs of providing
these datasets.
Recommender systems and so-called market approaches have previously been
used to solve this type of resource allocation problem, as shown for example in
allocation of equipment for production processes in manufacturing or for spare part
supplier selection. They can therefore also be seen as a potential application for
the problem of data overload.
This thesis introduces the so-called RecorDa approach: an architecture using
market approaches and recommender systems on their own or by combining them
into one system. Its purpose is to identify which data is more relevant for a user’s
decision and improve allocation of relevant data to users.
Using a combination of case studies and experiments, this thesis develops and
tests the approach. It further compares RecorDa to search and other mechanisms.
The results indicate that RecorDa can provide significant benefit to users with
easier and more flexible access to relevant datasets compared to other
techniques, such as search in these databases. It is able to provide a fast increase
in precision and recall of relevant datasets while still keeping high novelty and
coverage of a large variety of datasets