Search CORE

3,695 research outputs found

크라우드소싱 시스템에서의 빠르고 신뢰성 높은 추론 알고리즘

Author: 이동현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. 정교민.As the need for large scale labeled data grows in various fields, the appearance of web-based crowdsourcing systems gives a promising solution to exploiting the wisdom of crowds efficiently in a short time with a relatively low budget. Despite their efficiency, crowdsourcing systems have an inherent problem in that responses from workers can be unreliable since workers are low-paid and have low responsibility. Although simple majority voting can be a natural solution, various research studies have sought to aggregate noisy responses to obtain greater reliability in results. In this dissertation, we propose novel iterative massage-passing style algorithms to infer the groundtruths from noisy answers, which can be directly applied to real crowdsourcing systems. While EM-based algorithms get the limelight in crowdsourcing systems due to their useful inference techniques, our proposed algorithms draw faster and more reliable answers through an iterative scheme based on the idea of low-rank matrix approximations. We show that the performance of our proposed iterative algorithms are order-optimal, which outperforms majority voting and EM-based algorithms. Unlike other researches solving simple binary-choice questions (yes & no), our studies cover more complex task types which contain multiple-choice questions, short-answer questions, K-approval voting, and real-valued vector regression.다양한 분야에서 라벨된 빅데이터를 필요로 하는 현재, 웹 기반 크라우드소싱 서비스들이 출범하며 상대적으로 적은 예산과 짧은 시간에도 효율적으로 사람들의 지혜를 활용할 수 있는 방법들이 제시되고 있다. 이러한 방법들의 효율성에도 불구하고, 크라우드소싱 시스템의 선천적인 문제점은 일을 맡은 사람들의 적은 보상 및 책임감 결여로 인해 그들의 응답을 완전히 신뢰할 수 없다는 점에 있다. 이에 다수결 방식이 자연스러운 해법으로 사용되지만, 보다 신뢰 높은 답을 얻어내기 위해 많은 연구들이 진행되고 있다. 본 박사학위 논문에서는 크라우드소싱 시스템에서 수많은 사람들로부터 받은 응답들을 모아 신뢰성 높은 응답을 추론하는 반복적 메세지전달 형태의 알고리즘들을 제시한다. 본 알고리즘들은 낮은랭크근사에 기반한 반복 추론 방법으로, 기존에 각광받던 EM 알고리즘들에 비해 더 빠르고 신뢰적인 정답을 추론해낸다. 더불어 본 알고리즘들의 추론 정확도가 최적에 매우 근접하며 다수결 방식 및 EM 알고리즘들의 정확도를 상회한다는 것을 이론적 증명 및 실험적 결과를 통해 제시한다. 본 연구는 실제 크라우드소싱에서 대다수의 응답 유형을 차지하는 객관식 응답, 주관식 응답, 복수 선택 응답, 및 실수 값 응답의 추론 문제를 다루며, 기존 양자택일 응답 추론 문제만을 다루는 기존 연구들과 큰 차별성을 가진다.1 Introduction 1 2 Background 9 2.1 Crowdsourcing Systems for Binary-choice Questions 9 2.1.1 Majority Voting 10 2.1.2 Expectation Maximization 11 2.1.3 Message Passing 11 3 Crowdsourcing Systems for Multiple-choice Questions 12 3.1 Related Work 13 3.2 Problem Setup 16 3.3 Inference Algorithm 17 3.3.1 Task Allocation 17 3.3.2 Multiple Iterative Algorithm 18 3.3.3 Task Allocation for General Setting 20 3.4 Applications 23 3.5 Analysis of Algorithms 25 3.5.1 Quality of Workers 25 3.5.2 Bound on the Average Error Probability 27 3.5.3 Proof of the Error Bounds 29 3.5.4 Proof of Sub-Gaussianity 32 3.6 Experimental Results 36 3.6.1 Comparison with Other Algorithms 37 3.6.2 Adaptive Scenario 38 3.6.3 Simulations on a Set of Various D Values 41 3.7 Conclusion 43 4 Crowdsourcing Systems for Multiple-choice Questions with K-Approval Voting 45 4.1 Related Work 47 4.2 Problem Setup 49 4.2.1 Problem Definition 49 4.2.2 Worker Model for Various (D, K) 50 4.3 Inference Algorithm 51 4.4 Analysis of Algorithms 53 4.4.1 Worker Model 55 4.4.2 Quality of Workers 56 4.4.3 Bound on the Average Error Probability 58 4.4.4 Proof of the Error Bounds 59 4.4.5 Proof of Sub-Gaussianity 62 4.4.6 Phase Transition 67 4.5 Experimental Results 68 4.5.1 Performance on the Average Error with q and l 68 4.5.2 Relationship between Reliability and y-message 69 4.5.3 Performance on the Average Error with Various (D, K) Pairs 69 4.6 Conclusion 72 5 Crowdsourcing Systems for Real-valued Vector Regression 73 5.1 Related Work 75 5.2 Problem Setup 77 5.3 Inference Algorithm 78 5.3.1 Task Message 79 5.3.2 Worker Message 80 5.4 Analysis of Algorithms 81 5.4.1 Worker Model 81 5.4.2 Oracle Estimator 84 5.4.3 Bound on the Average Error Probability 86 5.5 Experimental Results 91 5.5.1 Real Crowdsourcing Data 91 5.5.2 Verification of the Error Bounds with Synthetic data 96 5.6 Conclusion 98 6 Conclusions 99Docto

SNU Open Repository and Archive

Optimal Inference in Crowdsourced Classification via Belief Propagation

Author: Oh Sewoong
Ok Jungseul
Shin Jinwoo
Yi Yung
Publication venue
Publication date: 11/01/2017
Field of study

Crowdsourcing systems are popular for solving large-scale labelling tasks with low-paid workers. We study the problem of recovering the true labels from the possibly erroneous crowdsourced labels under the popular Dawid-Skene model. To address this inference problem, several algorithms have recently been proposed, but the best known guarantee is still significantly larger than the fundamental limit. We close this gap by introducing a tighter lower bound on the fundamental limit and proving that Belief Propagation (BP) exactly matches this lower bound. The guaranteed optimality of BP is the strongest in the sense that it is information-theoretically impossible for any other algorithm to correctly label a larger fraction of the tasks. Experimental results suggest that BP is close to optimal for all regimes considered and improves upon competing state-of-the-art algorithms.Comment: This article is partially based on preliminary results published in the proceeding of the 33rd International Conference on Machine Learning (ICML 2016

arXiv.org e-Print Archive

포항공과대학교

Iterative Bayesian Learning for Crowdsourced Regression

Author: Jang Yunhun
Oh Sewoong
Ok Jungseul
Shin Jinwoo
Yi Yung
Publication venue
Publication date: 08/10/2018
Field of study

Crowdsourcing platforms emerged as popular venues for purchasing human intelligence at low cost for large volume of tasks. As many low-paid workers are prone to give noisy answers, a common practice is to add redundancy by assigning multiple workers to each task and then simply average out these answers. However, to fully harness the wisdom of the crowd, one needs to learn the heterogeneous quality of each worker. We resolve this fundamental challenge in crowdsourced regression tasks, i.e., the answer takes continuous labels, where identifying good or bad workers becomes much more non-trivial compared to a classification setting of discrete labels. In particular, we introduce a Bayesian iterative scheme and show that it provably achieves the optimal mean squared error. Our evaluations on synthetic and real-world datasets support our theoretical results and show the superiority of the proposed scheme

arXiv.org e-Print Archive

포항공과대학교