1,323 research outputs found
Ground truth? Concept-based communities versus the external classification of physics manuscripts
Community detection techniques are widely used to infer hidden structures
within interconnected systems. Despite demonstrating high accuracy on
benchmarks, they reproduce the external classification for many real-world
systems with a significant level of discrepancy. A widely accepted reason
behind such outcome is the unavoidable loss of non-topological information
(such as node attributes) encountered when the original complex system is
represented as a network. In this article we emphasize that the observed
discrepancies may also be caused by a different reason: the external
classification itself. For this end we use scientific publication data which i)
exhibit a well defined modular structure and ii) hold an expert-made
classification of research articles. Having represented the articles and the
extracted scientific concepts both as a bipartite network and as its unipartite
projection, we applied modularity optimization to uncover the inner thematic
structure. The resulting clusters are shown to partly reflect the author-made
classification, although some significant discrepancies are observed. A
detailed analysis of these discrepancies shows that they carry essential
information about the system, mainly related to the use of similar techniques
and methods across different (sub)disciplines, that is otherwise omitted when
only the external classification is considered.Comment: 15 pages, 2 figure
Finish Them!: Pricing Algorithms for Human Computation
Given a batch of human computation tasks, a commonly ignored aspect is how
the price (i.e., the reward paid to human workers) of these tasks must be set
or varied in order to meet latency or cost constraints. Often, the price is set
up-front and not modified, leading to either a much higher monetary cost than
needed (if the price is set too high), or to a much larger latency than
expected (if the price is set too low). Leveraging a pricing model from prior
work, we develop algorithms to optimally set and then vary price over time in
order to meet a (a) user-specified deadline while minimizing total monetary
cost (b) user-specified monetary budget constraint while minimizing total
elapsed time. We leverage techniques from decision theory (specifically, Markov
Decision Processes) for both these problems, and demonstrate that our
techniques lead to upto 30\% reduction in cost over schemes proposed in prior
work. Furthermore, we develop techniques to speed-up the computation, enabling
users to leverage the price setting algorithms on-the-fly
Robust Plackett–Luce model for k-ary crowdsourced preferences
© 2017, The Author(s). The aggregation of k-ary preferences is an emerging ranking problem, which plays an important role in several aspects of our daily life, such as ordinal peer grading and online product recommendation. At the same time, crowdsourcing has become a trendy way to provide a plethora of k-ary preferences for this ranking problem, due to convenient platforms and low costs. However, k-ary preferences from crowdsourced workers are often noisy, which inevitably degenerates the performance of traditional aggregation models. To address this challenge, in this paper, we present a RObust PlAckett–Luce (ROPAL) model. Specifically, to ensure the robustness, ROPAL integrates the Plackett–Luce model with a denoising vector. Based on the Kendall-tau distance, this vector corrects k-ary crowdsourced preferences with a certain probability. In addition, we propose an online Bayesian inference to make ROPAL scalable to large-scale preferences. We conduct comprehensive experiments on simulated and real-world datasets. Empirical results on “massive synthetic” and “real-world” datasets show that ROPAL with online Bayesian inference achieves substantial improvements in robustness and noisy worker detection over current approaches
Crowdsourcing Without a Crowd: Reliable Online Species Identification Using Bayesian Models to Minimize Crowd Size
We present an incremental Bayesian model that resolves key issues of crowd size and data quality for consensus labeling. We evaluate our method using data collected from a real-world citizen science program, BeeWatch, which invites members of the public in the United Kingdom to classify (label) photographs of bumblebees as one of 22 possible species. The biological recording domain poses two key and hitherto unaddressed challenges for consensus models of crowdsourcing: (1) the large number of potential species makes classification difficult, and (2) this is compounded by limited crowd availability, stemming from both the inherent difficulty of the task and the lack of relevant skills among the general public. We demonstrate that consensus labels can be reliably found in such circumstances with very small crowd sizes of around three to five users (i.e., through group sourcing). Our incremental Bayesian model, which minimizes crowd size by re-evaluating the quality of the consensus label following each species identification solicited from the crowd, is competitive with a Bayesian approach that uses a larger but fixed crowd size and outperforms majority voting. These results have important ecological applicability: biological recording programs such as BeeWatch can sustain themselves when resources such as taxonomic experts to confirm identifications by photo submitters are scarce (as is typically the case), and feedback can be provided to submitters in a timely fashion. More generally, our model provides benefits to any crowdsourced consensus labeling task where there is a cost (financial or otherwise) associated with soliciting a label
Hierarchical Entity Resolution using an Oracle
In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets
- …