691 research outputs found

    Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach

    Full text link
    The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for quality assessment tasks has pushed a high demand for standardized methods, resulting in ITU-T Rec. P.808. This work investigates the impact of the number of judgments on the reliability and the validity of quality ratings collected through crowdsourcing-based speech quality assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments on different platforms were conducted to evaluate the overall quality of three different speech datasets, using the Absolute Category Rating procedure. For each dataset, the Mean Opinion Scores (MOS) are calculated using differing numbers of crowdsourcing judgements. Then the results are compared to MOS values collected in a standard laboratory experiment, to assess the validity of crowdsourcing approach as a function of number of votes. In addition, the reliability of the average scores is analyzed by checking inter-rater reliability, gain in certainty, and the confidence of the MOS. The results provide a suggestion on the required number of votes per condition, and allow to model its impact on validity and reliability.Comment: This paper has been accepted for publication in the 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX

    Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods

    Get PDF
    Subjective speech quality assessment has traditionally been carried out in laboratory environments under controlled conditions. With the advent of crowdsourcing platforms tasks, which need human intelligence, can be resolved by crowd workers over the Internet. Crowdsourcing also offers a new paradigm for speech quality assessment, promising higher ecological validity of the quality judgments at the expense of potentially lower reliability. This paper compares laboratory-based and crowdsourcing-based speech quality assessments in terms of comparability of results and efficiency. For this purpose, three pairs of listening-only tests have been carried out using three different crowdsourcing platforms and following the ITU-T Recommendation P.808. In each test, listeners judge the overall quality of the speech sample following the Absolute Category Rating procedure. We compare the results of the crowdsourcing approach with the results of standard laboratory tests performed according to the ITU-T Recommendation P.800. Results show that in most cases, both paradigms lead to comparable results. Notable differences are discussed with respect to their sources, and conclusions are drawn that establish practical guidelines for crowdsourcing-based speech quality assessment

    An Open source Implementation of ITU-T Recommendation P.808 with Validation

    Full text link
    The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test

    Application of Just-Noticeable Difference in Quality as Environment Suitability Test for Crowdsourcing Speech Quality Assessment Task

    Full text link
    Crowdsourcing micro-task platforms facilitate subjective media quality assessment by providing access to a highly scale-able, geographically distributed and demographically diverse pool of crowd workers. Those workers participate in the experiment remotely from their own working environment, using their own hardware. In the case of speech quality assessment, preliminary work showed that environmental noise at the listener's side and the listening device (loudspeaker or headphone) significantly affect perceived quality, and consequently the reliability and validity of subjective ratings. As a consequence, ITU-T Rec. P.808 specifies requirements for the listening environment of crowd workers when assessing speech quality. In this paper, we propose a new Just Noticeable Difference of Quality (JNDQ) test as a remote screening method for assessing the suitability of the work environment for participating in speech quality assessment tasks. In a laboratory experiment, participants performed this JNDQ test with different listening devices in different listening environments, including a silent room according to ITU-T Rec. P.800 and a simulated background noise scenario. Results show a significant impact of the environment and the listening device on the JNDQ threshold. Thus, the combination of listening device and background noise needs to be screened in a crowdsourcing speech quality test. We propose a minimum threshold of our JNDQ test as an easily applicable screening method for this purpose.Comment: This paper has been accepted for publication in the 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX

    KonVid-150k: a dataset for no-reference video quality assessment of videos in-the-wild.

    Get PDF
    Video quality assessment (VQA) methods focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: KonVid-150k. It consists of a coarsely annotated set of 153,841 videos having five quality ratings each, and 1,596 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep-features (MLSP). They are exceptionally well suited for training at scale, compared to deep transfer learning approaches. Our best method, MLSP-VQA-FF, improves the Spearman rank-order correlation coefficient (SRCC) performance metric on the commonly used KoNViD-1k in-the-wild benchmark dataset to 0.82. It surpasses the best existing deep-learning model (0.80 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method for videos in-the-wild. Finally, we show that the MLSP-VQA models trained on KonVid-150k sets the new state-of-the-art for cross-test performance on KoNViD-1k and LIVE-Qualcomm with a 0.83 and 0.64 SRCC, respectively. For KoNViD-1k this inter-dataset testing outperforms intra-dataset experiments, showing excellent generalization

    Mining and quality assessment of mashup model patterns with the crowd: A feasibility study

    Get PDF
    Pattern mining, that is, the automated discovery of patterns from data, is a mathematically complex and computationally demanding problem that is generally not manageable by humans. In this article, we focus on small datasets and study whether it is possible to mine patterns with the help of the crowd by means of a set of controlled experiments on a common crowdsourcing platform. We specifically concentrate on mining model patterns from a dataset of real mashup models taken from Yahoo! Pipes and cover the entire pattern mining process, including pattern identification and quality assessment. The results of our experiments show that a sensible design of crowdsourcing tasks indeed may enable the crowd to identify patterns from small datasets (40 models). The results, however, also show that the design of tasks for the assessment of the quality of patterns to decide which patterns to retain for further processing and use is much harder (our experiments fail to elicit assessments from the crowd that are similar to those by an expert). The problem is relevant in general to model-driven development (e.g., UML, business processes, scientific workflows), in that reusable model patterns encode valuable modeling and domain knowledge, such as best practices, organizational conventions, or technical choices, that modelers can benefit from when designing their own models

    Dynamic Estimation of Rater Reliability using Multi-Armed Bandits

    Get PDF
    One of the critical success factors for supervised machine learning is the quality of target values, or predictions, associated with training instances. Predictions can be discrete labels (such as a binary variable specifying whether a blog post is positive or negative) or continuous ratings (for instance, how boring a video is on a 10-point scale). In some areas, predictions are readily available, while in others, the eort of human workers has to be involved. For instance, in the task of emotion recognition from speech, a large corpus of speech recordings is usually available, and humans denote which emotions are present in which recordings

    Best Practices and Recommendations for Crowdsourced QoE - Lessons learned from the Qualinet Task Force Crowdsourcing

    Get PDF
    Crowdsourcing is a popular approach that outsources tasks via the Internet to a large number of users. Commercial crowdsourcing platforms provide a global pool of users employed for performing short and simple online tasks. For quality assessment of multimedia services and applications, crowdsourcing enables new possibilities by moving the subjective test into the crowd resulting in larger diversity of the test subjects, faster turnover of test campaigns, and reduced costs due to low reimbursement costs of the participants. Further, crowdsourcing allows easily addressing additional features like real-life environments. This white paper summarizes the recommendations and best practices for crowdsourced quality assessment of multimedia applications from the Qualinet Task Force on “Crowdsourcing”. The European Network on Quality of Experience in Multimedia Systems and Services Qualinet (COST Action IC 1003, see www.qualinet.eu) established this task force in 2012 which has more than 30 members. The recommendation paper resulted from the experience in designing, implementing, and conducting crowdsourcing experiments as well as the analysis of the crowdsourced user ratings and context data
    • …
    corecore