22 research outputs found

    An Open source Implementation of ITU-T Recommendation P.808 with Validation

    Full text link
    The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test

    Deepfake detection: humans vs. machines

    Full text link
    Deepfake videos, where a person's face is automatically swapped with a face of someone else, are becoming easier to generate with more realistic results. In response to the threat such manipulations can pose to our trust in video evidence, several large datasets of deepfake videos and many methods to detect them were proposed recently. However, it is still unclear how realistic deepfake videos are for an average person and whether the algorithms are significantly better than humans at detecting them. In this paper, we present a subjective study conducted in a crowdsourcing-like scenario, which systematically evaluates how hard it is for humans to see if the video is deepfake or not. For the evaluation, we used 120 different videos (60 deepfakes and 60 originals) manually pre-selected from the Facebook deepfake database, which was provided in the Kaggle's Deepfake Detection Challenge 2020. For each video, a simple question: "Is face of the person in the video real of fake?" was answered on average by 19 na\"ive subjects. The results of the subjective evaluation were compared with the performance of two different state of the art deepfake detection methods, based on Xception and EfficientNets (B4 variant) neural networks, which were pre-trained on two other large public databases: the Google's subset from FaceForensics++ and the recent Celeb-DF dataset. The evaluation demonstrates that while the human perception is very different from the perception of a machine, both successfully but in different ways are fooled by deepfakes. Specifically, algorithms struggle to detect those deepfake videos, which human subjects found to be very easy to spot

    Analysis of Problem Tokens to Rank Factors Impacting Quality in VoIP Applications

    Full text link
    User-perceived quality-of-experience (QoE) in internet telephony systems is commonly evaluated using subjective ratings computed as a Mean Opinion Score (MOS). In such systems, while user MOS can be tracked on an ongoing basis, it does not give insight into which factors of a call induced any perceived degradation in QoE -- it does not tell us what caused a user to have a sub-optimal experience. For effective planning of product improvements, we are interested in understanding the impact of each of these degrading factors, allowing the estimation of the return (i.e., the improvement in user QoE) for a given investment. To obtain such insights, we advocate the use of an end-of-call "problem token questionnaire" (PTQ) which probes the user about common call quality issues (e.g., distorted audio or frozen video) which they may have experienced. In this paper, we show the efficacy of this questionnaire using data gathered from over 700,000 end-of-call surveys gathered from Skype (a large commercial VoIP application). We present a method to rank call quality and reliability issues and address the challenge of isolating independent factors impacting the QoE. Finally, we present representative examples of how these problem tokens have proven to be useful in practice

    Objective and subjective evaluation of light field image compression algorithms

    Get PDF
    This paper reports results of subjective and objective quality assessments of responses to a grand challenge on light field image compression. The goal of the challenge was to collect and evaluate new compression algorithms for light field images. In total seven proposals were received, out of which five were accepted for further evaluations. For objective evaluations, conventional metrics were used, whereas the double stimulus continuous quality scale method was selected to perform subjective assessments. Results show competitive performance among submitted proposals. However, in low bitrates, one proposal outperforms the others

    Impact of interactivity on the assessment of quality of experience for light field content

    Get PDF
    The recent advances in light field imaging are changing the way in which visual content is captured, processed and consumed. Storage and delivery systems for light field images rely on efficient compression algorithms. Such algorithms must additionally take into account the feature-rich rendering for light field content. Therefore, a proper evaluation of visual quality is essential to design and improve coding solutions for light field content. Consequently, the design of subjective tests should also reflect the light field rendering process. This paper aims at presenting and comparing two methodologies to assess the quality of experience in light field imaging. The first methodology uses an interactive approach, allowing subjects to engage with the light field content when assessing it. The second, on the other hand, is completely passive to ensure all the subjects will have the same experience. Advantages and drawbacks of each approach are compared by relying on statistical analysis of results and conclusions are drawn. The obtained results provide useful insights for future design of evaluation techniques for light field content

    Crowdsourcing evaluation of high dynamic range compression

    Get PDF
    Crowdsourcing is becoming a popular cost effective alternative to lab-based evaluations for subjective quality assessment. However, crowd-based evaluations are constrained by the limited availability of display devices used by typical online workers, which makes the evaluation of high dynamic range (HDR) content a challenging task. In this paper, we investigate the feasibility of using low dynamic range versions of original HDR content obtained with tone mapping operators (TMOs) in crowdsourcing evaluations. We conducted two crowdsourcing experiments by employing workers from Microworkers platform. In the first experiment, we evaluate five HDR images encoded at different bit rates with the upcoming JPEG XT coding standard. To find best suitable TMO, we create eleven tone-mapped versions of these five HDR images by using eleven different TMOs. The crowdsourcing results are compared to a reference ground truth obtained via a subjective assessment of the same HDR images on a Dolby `Pulsar' HDR monitor in a laboratory environment. The second crowdsourcing evaluation uses semantic differentiators to better understand the characteristics of eleven different TMOs. The crowdsourcing evaluations show that some TMOs are more suitable for evaluation of HDR image compression

    Quality Evaluation of HEVC and VP9 Video Compression in Real-Time Applications

    Get PDF
    Video consumption over Internet has increased significantly over the recent years and occupies the majority of the overall data traffic. To decrease the load on the Internet infrastructure and reduce bandwidth taken by video, higher efficiency video codecs, such as H.265/HEVC and VP9, have been developed. The availability of these two new competing video coding formats raises the question of which is more efficient in terms of rate-distortion and by how much they outperform the current state-of-the-art coding standard, H.264/AVC. This paper provides an answer to this difficult question for low-delay video applications, e.g., real-time video streaming/conferencing or video surveillance. The benchmarking of HEVC and VP9 video compression was conducted by means of subjective evaluations, assuming web browser playback, an uncontrolled environment, and HD video content. Considering a wide range of bit rates from very low to high bit rates, corresponding to low quality up to transparent quality (when compared to the original video), results show a clear advantage of HEVC with average bit rate savings of 59.5% when compared to AVC and 42.4% when compared to VP9

    Survey of Web-based Crowdsourcing Frameworks for Subjective Quality Assessment

    Get PDF
    The popularity of the crowdsourcing for performing various tasks online increased significantly in the past few years. The low cost and flexibility of crowdsourcing, in particular, attracted researchers in the field of subjective multimedia evaluations and Quality of Experience (QoE). Since online assessment of multimedia content is challenging, several dedicated frameworks were created to aid in the designing of the tests, including the support of the testing methodologies like ACR, DCR, and PC, setting up the tasks, training sessions, screening of the subjects, and storage of the resulted data. In this paper, we focus on the web-based frameworks for multimedia quality assessments that support commonly used crowdsourcing platforms such as Amazon Mechanical Turk and Microworkers. We provide a detailed overview of the crowdsourcing frameworks and evaluate them to aid researchers in the field of QoE assessment in the selection of frameworks and crowdsourcing platforms that are adequate for their experiments
    corecore