Search CORE

691 research outputs found

Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach

Author: Hirth Matthias
Hossfeld Tobias
Jiménez Rafael Zequeira
Metzger Florian
Möller Sebastian
Naderi Babak
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/03/2020
Field of study

The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for quality assessment tasks has pushed a high demand for standardized methods, resulting in ITU-T Rec. P.808. This work investigates the impact of the number of judgments on the reliability and the validity of quality ratings collected through crowdsourcing-based speech quality assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments on different platforms were conducted to evaluate the overall quality of three different speech datasets, using the Absolute Category Rating procedure. For each dataset, the Mean Opinion Scores (MOS) are calculated using differing numbers of crowdsourcing judgements. Then the results are compared to MOS values collected in a standard laboratory experiment, to assess the validity of crowdsourcing approach as a function of number of votes. In addition, the reliability of the average scores is analyzed by checking inter-rater reliability, gain in certainty, and the confidence of the MOS. The results provide a suggestion on the required number of votes per condition, and allow to model its impact on validity and reliability.Comment: This paper has been accepted for publication in the 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX

arXiv.org e-Print Archive

Crossref

Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods

Author: Hirth Matthias
Hoßfeld Tobias
Metzger Florian
Möller Sebastian
Naderi Babak
Zequeira Jiménez Rafael
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/11/2020
Field of study

Subjective speech quality assessment has traditionally been carried out in laboratory environments under controlled conditions. With the advent of crowdsourcing platforms tasks, which need human intelligence, can be resolved by crowd workers over the Internet. Crowdsourcing also offers a new paradigm for speech quality assessment, promising higher ecological validity of the quality judgments at the expense of potentially lower reliability. This paper compares laboratory-based and crowdsourcing-based speech quality assessments in terms of comparability of results and efficiency. For this purpose, three pairs of listening-only tests have been carried out using three different crowdsourcing platforms and following the ITU-T Recommendation P.808. In each test, listeners judge the overall quality of the speech sample following the Absolute Category Rating procedure. We compare the results of the crowdsourcing approach with the results of standard laboratory tests performed according to the ITU-T Recommendation P.800. Results show that in most cases, both paradigms lead to comparable results. Notable differences are discussed with respect to their sources, and conclusions are drawn that establish practical guidelines for crowdsourcing-based speech quality assessment

Digitale Bibliothek Thüringen

An Open source Implementation of ITU-T Recommendation P.808 with Validation

Author: Cutler Ross
Naderi Babak
Publication venue: 'International Speech Communication Association'
Publication date: 16/05/2020
Field of study

The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test

arXiv.org e-Print Archive

Crossref

Application of Just-Noticeable Difference in Quality as Environment Suitability Test for Crowdsourcing Speech Quality Assessment Task

Author: Möller Sebastian
Naderi Babak
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/04/2020
Field of study

Crowdsourcing micro-task platforms facilitate subjective media quality assessment by providing access to a highly scale-able, geographically distributed and demographically diverse pool of crowd workers. Those workers participate in the experiment remotely from their own working environment, using their own hardware. In the case of speech quality assessment, preliminary work showed that environmental noise at the listener's side and the listening device (loudspeaker or headphone) significantly affect perceived quality, and consequently the reliability and validity of subjective ratings. As a consequence, ITU-T Rec. P.808 specifies requirements for the listening environment of crowd workers when assessing speech quality. In this paper, we propose a new Just Noticeable Difference of Quality (JNDQ) test as a remote screening method for assessing the suitability of the work environment for participating in speech quality assessment tasks. In a laboratory experiment, participants performed this JNDQ test with different listening devices in different listening environments, including a silent room according to ITU-T Rec. P.800 and a simulated background noise scenario. Results show a significant impact of the environment and the listening device on the JNDQ threshold. Thus, the combination of listening device and background noise needs to be screened in a crowdsourcing speech quality test. We propose a minimum threshold of our JNDQ test as an easily applicable screening method for this purpose.Comment: This paper has been accepted for publication in the 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX

arXiv.org e-Print Archive

Crossref

KonVid-150k: a dataset for no-reference video quality assessment of videos in-the-wild.

Author: Gotz-Hahn Franz
Hosu Vlad
Lin Hanhe
Saupe Dietmar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Video quality assessment (VQA) methods focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: KonVid-150k. It consists of a coarsely annotated set of 153,841 videos having five quality ratings each, and 1,596 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep-features (MLSP). They are exceptionally well suited for training at scale, compared to deep transfer learning approaches. Our best method, MLSP-VQA-FF, improves the Spearman rank-order correlation coefficient (SRCC) performance metric on the commonly used KoNViD-1k in-the-wild benchmark dataset to 0.82. It surpasses the best existing deep-learning model (0.80 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method for videos in-the-wild. Finally, we show that the MLSP-VQA models trained on KonVid-150k sets the new state-of-the-art for cross-test performance on KoNViD-1k and LIVE-Qualcomm with a 0.83 and 0.64 SRCC, respectively. For KoNViD-1k this inter-dataset testing outperforms intra-dataset experiments, showing excellent generalization

KOPS - The Institutional Repository of the University of Konstanz

Open Access Institutional Repository at Robert Gordon University

University of Dundee Online Publications

Mining and quality assessment of mashup model patterns with the crowd: A feasibility study

Author: Casati Fabio
Daniel Florian
Rodríguez Carlos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Pattern mining, that is, the automated discovery of patterns from data, is a mathematically complex and computationally demanding problem that is generally not manageable by humans. In this article, we focus on small datasets and study whether it is possible to mine patterns with the help of the crowd by means of a set of controlled experiments on a common crowdsourcing platform. We specifically concentrate on mining model patterns from a dataset of real mashup models taken from Yahoo! Pipes and cover the entire pattern mining process, including pattern identification and quality assessment. The results of our experiments show that a sensible design of crowdsourcing tasks indeed may enable the crowd to identify patterns from small datasets (40 models). The results, however, also show that the design of tasks for the assessment of the quality of patterns to decide which patterns to retain for further processing and use is much harder (our experiments fail to elicit assessments from the crowd that are similar to those by an expert). The problem is relevant in general to model-driven development (e.g., UML, business processes, scientific workflows), in that reusable model patterns encode valuable modeling and domain knowledge, such as best practices, organizational conventions, or technical choices, that modelers can benefit from when designing their own models

Archivio istituzionale della ricerca - Politecnico di Milano

Dynamic Estimation of Rater Reliability using Multi-Armed Bandits

Author: Tarasov Alexey
Publication venue: Dublin Institute of Technology
Publication date: 01/05/2014
Field of study

One of the critical success factors for supervised machine learning is the quality of target values, or predictions, associated with training instances. Predictions can be discrete labels (such as a binary variable specifying whether a blog post is positive or negative) or continuous ratings (for instance, how boring a video is on a 10-point scale). In some areas, predictions are readily available, while in others, the eort of human workers has to be involved. For instance, in the task of emotion recognition from speech, a large corpus of speech recordings is usually available, and humans denote which emotions are present in which recordings

Arrow@TUDublin

Best Practices and Recommendations for Crowdsourced QoE - Lessons learned from the Qualinet Task Force Crowdsourcing

Author: Egger Sebastian
Gardlo Bruno
Hirth Matthias
Hossfeld Tobias
Keimel Christian
Korshunov Pavel
Mazza Filippo
Naderi Babak
Redi Judith
Seufert Michael
Publication venue
Publication date: 27/01/2015
Field of study

Crowdsourcing is a popular approach that outsources tasks via the Internet to a large number of users. Commercial crowdsourcing platforms provide a global pool of users employed for performing short and simple online tasks. For quality assessment of multimedia services and applications, crowdsourcing enables new possibilities by moving the subjective test into the crowd resulting in larger diversity of the test subjects, faster turnover of test campaigns, and reduced costs due to low reimbursement costs of the participants. Further, crowdsourcing allows easily addressing additional features like real-life environments. This white paper summarizes the recommendations and best practices for crowdsourced quality assessment of multimedia applications from the Qualinet Task Force on “Crowdsourcing”. The European Network on Quality of Experience in Multimedia Systems and Services Qualinet (COST Action IC 1003, see www.qualinet.eu) established this task force in 2012 which has more than 30 members. The recommendation paper resulted from the experience in designing, implementing, and conducting crowdsourcing experiments as well as the analysis of the crowdsourced user ratings and context data

Infoscience - École polytechnique fédérale de Lausanne