22 research outputs found
An Open source Implementation of ITU-T Recommendation P.808 with Validation
The ITU-T Recommendation P.808 provides a crowdsourcing approach for
conducting a subjective assessment of speech quality using the Absolute
Category Rating (ACR) method. We provide an open-source implementation of the
ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended
our implementation to include Degradation Category Ratings (DCR) and Comparison
Category Ratings (CCR) test methods. We also significantly speed up the test
process by integrating the participant qualification step into the main rating
task compared to a two-stage qualification and rating solution. We provide
program scripts for creating and executing the subjective test, and data
cleansing and analyzing the answers to avoid operational errors. To validate
the implementation, we compare the Mean Opinion Scores (MOS) collected through
our implementation with MOS values from a standard laboratory experiment
conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility
of the result of the subjective speech quality assessment through crowdsourcing
using our implementation. Finally, we quantify the impact of parts of the
system designed to improve the reliability: environmental tests, gold and
trapping questions, rating patterns, and a headset usage test
Deepfake detection: humans vs. machines
Deepfake videos, where a person's face is automatically swapped with a face
of someone else, are becoming easier to generate with more realistic results.
In response to the threat such manipulations can pose to our trust in video
evidence, several large datasets of deepfake videos and many methods to detect
them were proposed recently. However, it is still unclear how realistic
deepfake videos are for an average person and whether the algorithms are
significantly better than humans at detecting them. In this paper, we present a
subjective study conducted in a crowdsourcing-like scenario, which
systematically evaluates how hard it is for humans to see if the video is
deepfake or not. For the evaluation, we used 120 different videos (60 deepfakes
and 60 originals) manually pre-selected from the Facebook deepfake database,
which was provided in the Kaggle's Deepfake Detection Challenge 2020. For each
video, a simple question: "Is face of the person in the video real of fake?"
was answered on average by 19 na\"ive subjects. The results of the subjective
evaluation were compared with the performance of two different state of the art
deepfake detection methods, based on Xception and EfficientNets (B4 variant)
neural networks, which were pre-trained on two other large public databases:
the Google's subset from FaceForensics++ and the recent Celeb-DF dataset. The
evaluation demonstrates that while the human perception is very different from
the perception of a machine, both successfully but in different ways are fooled
by deepfakes. Specifically, algorithms struggle to detect those deepfake
videos, which human subjects found to be very easy to spot
Analysis of Problem Tokens to Rank Factors Impacting Quality in VoIP Applications
User-perceived quality-of-experience (QoE) in internet telephony systems is
commonly evaluated using subjective ratings computed as a Mean Opinion Score
(MOS). In such systems, while user MOS can be tracked on an ongoing basis, it
does not give insight into which factors of a call induced any perceived
degradation in QoE -- it does not tell us what caused a user to have a
sub-optimal experience. For effective planning of product improvements, we are
interested in understanding the impact of each of these degrading factors,
allowing the estimation of the return (i.e., the improvement in user QoE) for a
given investment. To obtain such insights, we advocate the use of an
end-of-call "problem token questionnaire" (PTQ) which probes the user about
common call quality issues (e.g., distorted audio or frozen video) which they
may have experienced. In this paper, we show the efficacy of this questionnaire
using data gathered from over 700,000 end-of-call surveys gathered from Skype
(a large commercial VoIP application). We present a method to rank call quality
and reliability issues and address the challenge of isolating independent
factors impacting the QoE. Finally, we present representative examples of how
these problem tokens have proven to be useful in practice
Objective and subjective evaluation of light field image compression algorithms
This paper reports results of subjective and objective quality assessments of responses to a grand challenge on light field image compression. The goal of the challenge was to collect and evaluate new compression algorithms for light field images. In total seven proposals were received, out of which five were accepted for further evaluations. For objective evaluations, conventional metrics were used, whereas the double stimulus continuous quality scale method was selected to perform subjective assessments. Results show competitive performance among submitted proposals. However, in low bitrates, one proposal outperforms the others
Impact of interactivity on the assessment of quality of experience for light field content
The recent advances in light field imaging are changing the way in which visual content is captured, processed and consumed. Storage and delivery systems for light field images rely on efficient compression algorithms. Such algorithms must additionally take into account the feature-rich rendering for light field content. Therefore, a proper evaluation of visual quality is essential to design and improve coding solutions for light field content. Consequently, the design of subjective tests should also reflect the light field rendering process. This paper aims at presenting and comparing two methodologies to assess the quality of experience in light field imaging. The first methodology uses an interactive approach, allowing subjects to engage with the light field content when assessing it. The second, on the other hand, is completely passive to ensure all the subjects will have the same experience. Advantages and drawbacks of each approach are compared by relying on statistical analysis of results and conclusions are drawn. The obtained results provide useful insights for future design of evaluation techniques for light field content
Crowdsourcing evaluation of high dynamic range compression
Crowdsourcing is becoming a popular cost effective alternative to lab-based evaluations for subjective quality assessment. However, crowd-based evaluations are constrained by the limited availability of display devices used by typical online workers, which makes the evaluation of high dynamic range (HDR) content a challenging task. In this paper, we investigate the feasibility of using low dynamic range versions of original HDR content obtained with tone mapping operators (TMOs) in crowdsourcing evaluations. We conducted two crowdsourcing experiments by employing workers from Microworkers platform. In the first experiment, we evaluate five HDR images encoded at different bit rates with the upcoming JPEG XT coding standard. To find best suitable TMO, we create eleven tone-mapped versions of these five HDR images by using eleven different TMOs. The crowdsourcing results are compared to a reference ground truth obtained via a subjective assessment of the same HDR images on a Dolby `Pulsar' HDR monitor in a laboratory environment. The second crowdsourcing evaluation uses semantic differentiators to better understand the characteristics of eleven different TMOs. The crowdsourcing evaluations show that some TMOs are more suitable for evaluation of HDR image compression
Quality Evaluation of HEVC and VP9 Video Compression in Real-Time Applications
Video consumption over Internet has increased significantly over the recent years and occupies the majority of the overall data traffic. To decrease the load on the Internet infrastructure and reduce bandwidth taken by video, higher efficiency video codecs, such as H.265/HEVC and VP9, have been developed. The availability of these two new competing video coding formats raises the question of which is more efficient in terms of rate-distortion and by how much they outperform the current state-of-the-art coding standard, H.264/AVC. This paper provides an answer to this difficult question for low-delay video applications, e.g., real-time video streaming/conferencing or video surveillance. The benchmarking of HEVC and VP9 video compression was conducted by means of subjective evaluations, assuming web browser playback, an uncontrolled environment, and HD video content. Considering a wide range of bit rates from very low to high bit rates, corresponding to low quality up to transparent quality (when compared to the original video), results show a clear advantage of HEVC with average bit rate savings of 59.5% when compared to AVC and 42.4% when compared to VP9
Survey of Web-based Crowdsourcing Frameworks for Subjective Quality Assessment
The popularity of the crowdsourcing for performing various tasks online increased significantly in the past few years. The low cost and flexibility of crowdsourcing, in particular, attracted researchers in the field of subjective multimedia evaluations and Quality of Experience (QoE). Since online assessment of multimedia content is challenging, several dedicated frameworks were created to aid in the designing of the tests, including the support of the testing methodologies like ACR, DCR, and PC, setting up the tasks, training sessions, screening of the subjects, and storage of the resulted data. In this paper, we focus on the web-based frameworks for multimedia quality assessments that support commonly used crowdsourcing platforms such as Amazon Mechanical Turk and Microworkers. We provide a detailed overview of the crowdsourcing frameworks and evaluate them to aid researchers in the field of QoE assessment in the selection of frameworks and crowdsourcing platforms that are adequate for their experiments