691 research outputs found
Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach
The subjective quality of transmitted speech is traditionally assessed in a
controlled laboratory environment according to ITU-T Rec. P.800. In turn, with
crowdsourcing, crowdworkers participate in a subjective online experiment using
their own listening device, and in their own working environment. Despite such
less controllable conditions, the increased use of crowdsourcing micro-task
platforms for quality assessment tasks has pushed a high demand for
standardized methods, resulting in ITU-T Rec. P.808. This work investigates the
impact of the number of judgments on the reliability and the validity of
quality ratings collected through crowdsourcing-based speech quality
assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments
on different platforms were conducted to evaluate the overall quality of three
different speech datasets, using the Absolute Category Rating procedure. For
each dataset, the Mean Opinion Scores (MOS) are calculated using differing
numbers of crowdsourcing judgements. Then the results are compared to MOS
values collected in a standard laboratory experiment, to assess the validity of
crowdsourcing approach as a function of number of votes. In addition, the
reliability of the average scores is analyzed by checking inter-rater
reliability, gain in certainty, and the confidence of the MOS. The results
provide a suggestion on the required number of votes per condition, and allow
to model its impact on validity and reliability.Comment: This paper has been accepted for publication in the 2020 Twelfth
International Conference on Quality of Multimedia Experience (QoMEX
Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods
Subjective speech quality assessment has traditionally been carried out in laboratory environments under controlled conditions. With the advent of crowdsourcing platforms tasks, which need human intelligence, can be resolved by crowd workers over the Internet. Crowdsourcing also offers a new paradigm for speech quality assessment, promising higher ecological validity of the quality judgments at the expense of potentially lower reliability. This paper compares laboratory-based and crowdsourcing-based speech quality assessments in terms of comparability of results and efficiency. For this purpose, three pairs of listening-only tests have been carried out using three different crowdsourcing platforms and following the ITU-T Recommendation P.808. In each test, listeners judge the overall quality of the speech sample following the Absolute Category Rating procedure. We compare the results of the crowdsourcing approach with the results of standard laboratory tests performed according to the ITU-T Recommendation P.800. Results show that in most cases, both paradigms lead to comparable results. Notable differences are discussed with respect to their sources, and conclusions are drawn that establish practical guidelines for crowdsourcing-based speech quality assessment
An Open source Implementation of ITU-T Recommendation P.808 with Validation
The ITU-T Recommendation P.808 provides a crowdsourcing approach for
conducting a subjective assessment of speech quality using the Absolute
Category Rating (ACR) method. We provide an open-source implementation of the
ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended
our implementation to include Degradation Category Ratings (DCR) and Comparison
Category Ratings (CCR) test methods. We also significantly speed up the test
process by integrating the participant qualification step into the main rating
task compared to a two-stage qualification and rating solution. We provide
program scripts for creating and executing the subjective test, and data
cleansing and analyzing the answers to avoid operational errors. To validate
the implementation, we compare the Mean Opinion Scores (MOS) collected through
our implementation with MOS values from a standard laboratory experiment
conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility
of the result of the subjective speech quality assessment through crowdsourcing
using our implementation. Finally, we quantify the impact of parts of the
system designed to improve the reliability: environmental tests, gold and
trapping questions, rating patterns, and a headset usage test
Application of Just-Noticeable Difference in Quality as Environment Suitability Test for Crowdsourcing Speech Quality Assessment Task
Crowdsourcing micro-task platforms facilitate subjective media quality
assessment by providing access to a highly scale-able, geographically
distributed and demographically diverse pool of crowd workers. Those workers
participate in the experiment remotely from their own working environment,
using their own hardware. In the case of speech quality assessment, preliminary
work showed that environmental noise at the listener's side and the listening
device (loudspeaker or headphone) significantly affect perceived quality, and
consequently the reliability and validity of subjective ratings. As a
consequence, ITU-T Rec. P.808 specifies requirements for the listening
environment of crowd workers when assessing speech quality. In this paper, we
propose a new Just Noticeable Difference of Quality (JNDQ) test as a remote
screening method for assessing the suitability of the work environment for
participating in speech quality assessment tasks. In a laboratory experiment,
participants performed this JNDQ test with different listening devices in
different listening environments, including a silent room according to ITU-T
Rec. P.800 and a simulated background noise scenario. Results show a
significant impact of the environment and the listening device on the JNDQ
threshold. Thus, the combination of listening device and background noise needs
to be screened in a crowdsourcing speech quality test. We propose a minimum
threshold of our JNDQ test as an easily applicable screening method for this
purpose.Comment: This paper has been accepted for publication in the 2020 Twelfth
International Conference on Quality of Multimedia Experience (QoMEX
KonVid-150k: a dataset for no-reference video quality assessment of videos in-the-wild.
Video quality assessment (VQA) methods focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: KonVid-150k. It consists of a coarsely annotated set of 153,841 videos having five quality ratings each, and 1,596 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep-features (MLSP). They are exceptionally well suited for training at scale, compared to deep transfer learning approaches. Our best method, MLSP-VQA-FF, improves the Spearman rank-order correlation coefficient (SRCC) performance metric on the commonly used KoNViD-1k in-the-wild benchmark dataset to 0.82. It surpasses the best existing deep-learning model (0.80 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method for videos in-the-wild. Finally, we show that the MLSP-VQA models trained on KonVid-150k sets the new state-of-the-art for cross-test performance on KoNViD-1k and LIVE-Qualcomm with a 0.83 and 0.64 SRCC, respectively. For KoNViD-1k this inter-dataset testing outperforms intra-dataset experiments, showing excellent generalization
Mining and quality assessment of mashup model patterns with the crowd: A feasibility study
Pattern mining, that is, the automated discovery of patterns from data, is a mathematically complex and computationally demanding problem that is generally not manageable by humans. In this article, we focus on small datasets and study whether it is possible to mine patterns with the help of the crowd by means of a set of controlled experiments on a common crowdsourcing platform. We specifically concentrate on mining model patterns from a dataset of real mashup models taken from Yahoo! Pipes and cover the entire pattern mining process, including pattern identification and quality assessment. The results of our experiments show that a sensible design of crowdsourcing tasks indeed may enable the crowd to identify patterns from small datasets (40 models). The results, however, also show that the design of tasks for the assessment of the quality of patterns to decide which patterns to retain for further processing and use is much harder (our experiments fail to elicit assessments from the crowd that are similar to those by an expert). The problem is relevant in general to model-driven development (e.g., UML, business processes, scientific workflows), in that reusable model patterns encode valuable modeling and domain knowledge, such as best practices, organizational conventions, or technical choices, that modelers can benefit from when designing their own models
Dynamic Estimation of Rater Reliability using Multi-Armed Bandits
One of the critical success factors for supervised machine learning is the quality of target values, or predictions, associated with training instances. Predictions can be discrete labels (such as a binary variable specifying whether a blog post is positive or negative) or continuous ratings (for instance, how boring a video is on a 10-point scale). In some areas, predictions are readily available, while in others, the eort of human workers has to be involved. For instance, in the task of emotion recognition from speech, a large corpus of speech recordings is usually available, and humans denote which emotions are present in which recordings
Best Practices and Recommendations for Crowdsourced QoE - Lessons learned from the Qualinet Task Force Crowdsourcing
Crowdsourcing is a popular approach that outsources tasks via the Internet to a large number of users. Commercial crowdsourcing platforms provide a global pool of users employed for performing short and simple online tasks. For quality assessment of multimedia services and applications, crowdsourcing enables new possibilities by moving the subjective test into the crowd resulting in larger diversity of the test subjects, faster turnover of test campaigns, and reduced costs due to low reimbursement costs of the participants. Further, crowdsourcing allows easily addressing additional features like real-life environments. This white paper summarizes the recommendations and best practices for crowdsourced quality assessment of multimedia applications from the Qualinet Task Force on “Crowdsourcing”. The European Network on Quality of Experience in Multimedia Systems and Services Qualinet (COST Action IC 1003, see www.qualinet.eu) established this task force in 2012 which has more than 30 members. The recommendation paper resulted from the experience in designing, implementing, and conducting crowdsourcing experiments as well as the analysis of the crowdsourced user ratings and context data
- …