7,628 research outputs found
Uncertainty in Crowd Data Sourcing under Structural Constraints
Applications extracting data from crowdsourcing platforms must deal with the
uncertainty of crowd answers in two different ways: first, by deriving
estimates of the correct value from the answers; second, by choosing crowd
questions whose answers are expected to minimize this uncertainty relative to
the overall data collection goal. Such problems are already challenging when we
assume that questions are unrelated and answers are independent, but they are
even more complicated when we assume that the unknown values follow hard
structural constraints (such as monotonicity).
In this vision paper, we examine how to formally address this issue with an
approach inspired by [Amsterdamer et al., 2013]. We describe a generalized
setting where we model constraints as linear inequalities, and use them to
guide the choice of crowd questions and the processing of answers. We present
the main challenges arising in this setting, and propose directions to solve
them.Comment: 8 pages, vision paper. To appear at UnCrowd 201
Structurally Tractable Uncertain Data
Many data management applications must deal with data which is uncertain,
incomplete, or noisy. However, on existing uncertain data representations, we
cannot tractably perform the important query evaluation tasks of determining
query possibility, certainty, or probability: these problems are hard on
arbitrary uncertain input instances. We thus ask whether we could restrict the
structure of uncertain data so as to guarantee the tractability of exact query
evaluation. We present our tractability results for tree and tree-like
uncertain data, and a vision for probabilistic rule reasoning. We also study
uncertainty about order, proposing a suitable representation, and study
uncertain data conditioned by additional observations.Comment: 11 pages, 1 figure, 1 table. To appear in SIGMOD/PODS PhD Symposium
201
Focused Proofreading: Efficiently Extracting Connectomes from Segmented EM Images
Identifying complex neural circuitry from electron microscopic (EM) images
may help unlock the mysteries of the brain. However, identifying this circuitry
requires time-consuming, manual tracing (proofreading) due to the size and
intricacy of these image datasets, thus limiting state-of-the-art analysis to
very small brain regions. Potential avenues to improve scalability include
automatic image segmentation and crowd sourcing, but current efforts have had
limited success. In this paper, we propose a new strategy, focused
proofreading, that works with automatic segmentation and aims to limit
proofreading to the regions of a dataset that are most impactful to the
resulting circuit. We then introduce a novel workflow, which exploits
biological information such as synapses, and apply it to a large dataset in the
fly optic lobe. With our techniques, we achieve significant tracing speedups of
3-5x without sacrificing the quality of the resulting circuit. Furthermore, our
methodology makes the task of proofreading much more accessible and hence
potentially enhances the effectiveness of crowd sourcing
Engineering Crowdsourced Stream Processing Systems
A crowdsourced stream processing system (CSP) is a system that incorporates
crowdsourced tasks in the processing of a data stream. This can be seen as
enabling crowdsourcing work to be applied on a sample of large-scale data at
high speed, or equivalently, enabling stream processing to employ human
intelligence. It also leads to a substantial expansion of the capabilities of
data processing systems. Engineering a CSP system requires the combination of
human and machine computation elements. From a general systems theory
perspective, this means taking into account inherited as well as emerging
properties from both these elements. In this paper, we position CSP systems
within a broader taxonomy, outline a series of design principles and evaluation
metrics, present an extensible framework for their design, and describe several
design patterns. We showcase the capabilities of CSP systems by performing a
case study that applies our proposed framework to the design and analysis of a
real system (AIDR) that classifies social media messages during time-critical
crisis events. Results show that compared to a pure stream processing system,
AIDR can achieve a higher data classification accuracy, while compared to a
pure crowdsourcing solution, the system makes better use of human workers by
requiring much less manual work effort
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
Communication Network Design: Balancing Modularity and Mixing via Optimal Graph Spectra
By leveraging information technologies, organizations now have the ability to
design their communication networks and crowdsourcing platforms to pursue
various performance goals, but existing research on network design does not
account for the specific features of social networks, such as the notion of
teams. We fill this gap by demonstrating how desirable aspects of
organizational structure can be mapped parsimoniously onto the spectrum of the
graph Laplacian allowing the specification of structural objectives and build
on recent advances in non-convex programming to optimize them. This design
framework is general, but we focus here on the problem of creating graphs that
balance high modularity and low mixing time, and show how "liaisons" rather
than brokers maximize this objective
Using hybrid algorithmic-crowdsourcing methods for academic knowledge acquisition
such as Figures, Tables, Deļ¬nitions, Algo- rithms, etc., which are called Knowledge Cells hereafter. An advanced academic search engine which could take advantage of Knowledge Cells and their various relation- ships to obtain more accurate search results is expected. Further, itās expected to provide a ļ¬ne-grained search regard- ing to Knowledge Cells for deep-level information discovery and exploration. Therefore, it is important to identify and extract the Knowledge Cells and their various relationships which are often intrinsic and implicit in articles. With the exponential growth of scientiļ¬c publications, discovery and acquisition of such useful academic knowledge impose some practical challenges For example, existing algorithmic meth- ods can hardly extend to handle diverse layouts of journals, nor to scale up to process massive documents. As crowd- sourcing has become a powerful paradigm for large scale problem-solving especially for tasks that are difļ¬cult for computers but easy for human, we consider the problem of academic knowledge discovery and acquisition as a crowd- sourced database problem and show a hybrid framework to integrate the accuracy of crowdsourcing workers and the speed of automatic algorithms. In this paper, we introduce our current system implementation, a platform for academic knowledge discovery and acquisition (PANDA), as well as some interesting observations and promising future directions.Peer reviewe
OPENāEnabling Non-expert Users to Extract, Integrate, and Analyze Open Data
Government initiatives for more transparency and participation have lead to an increasing amount of structured data on the web in recent years. Many of these datasets have great potential. For example, a situational analysis and meaningful visualization of the data can assist in pointing out social or economic issues and raising peopleās awareness. Unfortunately, the ad-hoc analysis of this so-called Open Data can prove very complex and time-consuming, partly due to a lack of efficient system support.On the one hand, search functionality is required to identify relevant datasets. Common document retrieval techniques used in web search, however, are not optimized for Open Data and do not address the semantic ambiguity inherent in it. On the other hand, semantic integration is necessary to perform analysis tasks across multiple datasets. To do so in an ad-hoc fashion, however, requires more flexibility and easier integration than most data integration systems provide. It is apparent that an optimal management system for Open Data must combine aspects from both classic approaches. In this article, we propose OPEN, a novel concept for the management and situational analysis of Open Data within a single system. In our approach, we extend a classic database management system, adding support for the identification and dynamic integration of public datasets. As most web users lack the experience and training required to formulate structured queries in a DBMS, we add support for non-expert users to our system, for example though keyword queries. Furthermore, we address the challenge of indexing Open Data
- ā¦