Search CORE

1,323 research outputs found

Ground truth? Concept-based communities versus the external classification of physics manuscripts

Author: Boyarsky Alexey
Garlaschelli Diego
Gemmetto Valerio
Palchykov Vasyl
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Community detection techniques are widely used to infer hidden structures within interconnected systems. Despite demonstrating high accuracy on benchmarks, they reproduce the external classification for many real-world systems with a significant level of discrepancy. A widely accepted reason behind such outcome is the unavoidable loss of non-topological information (such as node attributes) encountered when the original complex system is represented as a network. In this article we emphasize that the observed discrepancies may also be caused by a different reason: the external classification itself. For this end we use scientific publication data which i) exhibit a well defined modular structure and ii) hold an expert-made classification of research articles. Having represented the articles and the extracted scientific concepts both as a bipartite network and as its unipartite projection, we applied modularity optimization to uncover the inner thematic structure. The resulting clusters are shown to partly reflect the author-made classification, although some significant discrepancies are observed. A detailed analysis of these discrepancies shows that they carry essential information about the system, mainly related to the use of similar techniques and methods across different (sub)disciplines, that is otherwise omitted when only the external classification is considered.Comment: 15 pages, 2 figure

arXiv.org e-Print Archive

Archivio della ricerca della Scuola IMT Alti Studi Lucca

Finish Them!: Pricing Algorithms for Human Computation

Author: Gao Yihan
Parameswaran Aditya
Publication venue
Publication date: 26/08/2014
Field of study

Given a batch of human computation tasks, a commonly ignored aspect is how the price (i.e., the reward paid to human workers) of these tasks must be set or varied in order to meet latency or cost constraints. Often, the price is set up-front and not modified, leading to either a much higher monetary cost than needed (if the price is set too high), or to a much larger latency than expected (if the price is set too low). Leveraging a pricing model from prior work, we develop algorithms to optimally set and then vary price over time in order to meet a (a) user-specified deadline while minimizing total monetary cost (b) user-specified monetary budget constraint while minimizing total elapsed time. We leverage techniques from decision theory (specifically, Markov Decision Processes) for both these problems, and demonstrate that our techniques lead to upto 30\% reduction in cost over schemes proposed in prior work. Furthermore, we develop techniques to speed-up the computation, enabling users to leverage the price setting algorithms on-the-fly

arXiv.org e-Print Archive

CiteSeerX

Robust Plackett–Luce model for k-ary crowdsourced preferences

Author: Han B
Pan Y
Tsang IW
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2018
Field of study

© 2017, The Author(s). The aggregation of k-ary preferences is an emerging ranking problem, which plays an important role in several aspects of our daily life, such as ordinal peer grading and online product recommendation. At the same time, crowdsourcing has become a trendy way to provide a plethora of k-ary preferences for this ranking problem, due to convenient platforms and low costs. However, k-ary preferences from crowdsourced workers are often noisy, which inevitably degenerates the performance of traditional aggregation models. To address this challenge, in this paper, we present a RObust PlAckett–Luce (ROPAL) model. Specifically, to ensure the robustness, ROPAL integrates the Plackett–Luce model with a denoising vector. Based on the Kendall-tau distance, this vector corrects k-ary crowdsourced preferences with a certain probability. In addition, we propose an online Bayesian inference to make ROPAL scalable to large-scale preferences. We conduct comprehensive experiments on simulated and real-world datasets. Empirical results on “massive synthetic” and “real-world” datasets show that ROPAL with online Bayesian inference achieves substantial improvements in robustness and noisy worker detection over current approaches

OPUS - University of Technology Sydney

Crowdsourcing Without a Crowd: Reliable Online Species Identification Using Bayesian Models to Minimize Crowd Size

Author: Comont Richard
Lambin Christopher
Mellish Chris
O’Mahony Elaine
Robinson Anne-Marie
Sharma Nirwan
Siddharthan Advaith
Van Der Wal René
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/07/2016
Field of study

We present an incremental Bayesian model that resolves key issues of crowd size and data quality for consensus labeling. We evaluate our method using data collected from a real-world citizen science program, BeeWatch, which invites members of the public in the United Kingdom to classify (label) photographs of bumblebees as one of 22 possible species. The biological recording domain poses two key and hitherto unaddressed challenges for consensus models of crowdsourcing: (1) the large number of potential species makes classification difficult, and (2) this is compounded by limited crowd availability, stemming from both the inherent difficulty of the task and the lack of relevant skills among the general public. We demonstrate that consensus labels can be reliably found in such circumstances with very small crowd sizes of around three to five users (i.e., through group sourcing). Our incremental Bayesian model, which minimizes crowd size by re-evaluating the quality of the consensus label following each species identification solicited from the crowd, is competitive with a Bayesian approach that uses a larger but fixed crowd size and outperforms majority voting. These results have important ecological applicability: biological recording programs such as BeeWatch can sustain themselves when resources such as taxonomic experts to confirm identifications by photo submitters are scarce (as is typically the case), and feedback can be provided to submitters in a timely fashion. More generally, our model provides benefits to any crowdsourced consensus labeling task where there is a cost (financial or otherwise) associated with soliciting a label

Aberdeen University Research

Crossref

Open Research Online (The Open University)

Ground truth? Concept-based communities versus the external classification of physics manuscripts

Author
Publication venue: Springer
Publication date: 20/08/2016
Field of study

Springer - Publisher Connector

Hierarchical Entity Resolution using an Oracle

Author: Barna Saha
Divesh Srivastava
Donatella Firmani
Sainyam Galhotra
Publication venue: place:New York
Publication date: 01/01/2022
Field of study

In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets

Archivio della ricerca- Università di Roma La Sapienza