49 research outputs found
Subjective Annotation for a Frame Interpolation Benchmark using Artefact Amplification
Current benchmarks for optical flow algorithms evaluate the estimation either
directly by comparing the predicted flow fields with the ground truth or
indirectly by using the predicted flow fields for frame interpolation and then
comparing the interpolated frames with the actual frames. In the latter case,
objective quality measures such as the mean squared error are typically
employed. However, it is well known that for image quality assessment, the
actual quality experienced by the user cannot be fully deduced from such simple
measures. Hence, we conducted a subjective quality assessment crowdscouring
study for the interpolated frames provided by one of the optical flow
benchmarks, the Middlebury benchmark. We collected forced-choice paired
comparisons between interpolated images and corresponding ground truth. To
increase the sensitivity of observers when judging minute difference in paired
comparisons we introduced a new method to the field of full-reference quality
assessment, called artefact amplification. From the crowdsourcing data, we
reconstructed absolute quality scale values according to Thurstone's model. As
a result, we obtained a re-ranking of the 155 participating algorithms w.r.t.
the visual quality of the interpolated frames. This re-ranking not only shows
the necessity of visual quality assessment as another evaluation metric for
optical flow and frame interpolation benchmarks, the results also provide the
ground truth for designing novel image quality assessment (IQA) methods
dedicated to perceptual quality of interpolated images. As a first step, we
proposed such a new full-reference method, called WAE-IQA. By weighing the
local differences between an interpolated image and its ground truth WAE-IQA
performed slightly better than the currently best FR-IQA approach from the
literature.Comment: arXiv admin note: text overlap with arXiv:1901.0536
DeepFL-IQA: Weak Supervision for Deep IQA Feature Learning
Multi-level deep-features have been driving state-of-the-art methods for
aesthetics and image quality assessment (IQA). However, most IQA benchmarks are
comprised of artificially distorted images, for which features derived from
ImageNet under-perform. We propose a new IQA dataset and a weakly supervised
feature learning approach to train features more suitable for IQA of
artificially distorted images. The dataset, KADIS-700k, is far more extensive
than similar works, consisting of 140,000 pristine images, 25 distortions
types, totaling 700k distorted versions. Our weakly supervised feature learning
is designed as a multi-task learning type training, using eleven existing
full-reference IQA metrics as proxies for differential mean opinion scores. We
also introduce a benchmark database, KADID-10k, of artificially degraded
images, each subjectively annotated by 30 crowd workers. We make use of our
derived image feature vectors for (no-reference) image quality assessment by
training and testing a shallow regression network on this database and five
other benchmark IQA databases. Our method, termed DeepFL-IQA, performs better
than other feature-based no-reference IQA methods and also better than all
tested full-reference IQA methods on KADID-10k. For the other five benchmark
IQA databases, DeepFL-IQA matches the performance of the best existing
end-to-end deep learning-based methods on average.Comment: dataset url: http://database.mmsp-kn.d
Critical analysis on the reproducibility of visual quality assessment using deep features
Data used to train supervised machine learning models are commonly split into
independent training, validation, and test sets. In this paper we illustrate
that intricate cases of data leakage have occurred in the no-reference video
and image quality assessment literature. We show that the performance results
of several recently published journal papers that are well above the best
performances in related works, cannot be reached. Our analysis shows that
information from the test set was inappropriately used in the training process
in different ways. When correcting for the data leakage, the performances of
the approaches drop below the state-of-the-art by a large margin. Additionally,
we investigate end-to-end variations to the discussed approaches, which do not
improve upon the original.Comment: 20 pages, 7 figures, PLOS ONE journal. arXiv admin note: substantial
text overlap with arXiv:2005.0440
SUR-Net: Predicting the Satisfied User Ratio Curve for Image Compression with Deep Learning
The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.The Satisfied User Ratio (SUR) curve for a lossy image compression scheme, e.g., JPEG, characterizes the probability distribution of the Just Noticeable Difference (JND) level, the smallest distortion level that can be perceived by a subject. We propose the first deep learning approach to predict such SUR curves. Instead of the direct approach of regressing the SUR
curve itself for a given reference image, our model is trained on pairs of images, original and compressed. Relying on a Siamese
Convolutional Neural Network (CNN), feature pooling, a fully connected regression-head, and transfer learning, we achieved
a good prediction performance. Experiments on the MCL-JCI dataset showed a mean Bhattacharyya distance between the
predicted and the original JND distributions of only 0.072
KonVid-150k: a dataset for no-reference video quality assessment of videos in-the-wild.
Video quality assessment (VQA) methods focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: KonVid-150k. It consists of a coarsely annotated set of 153,841 videos having five quality ratings each, and 1,596 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep-features (MLSP). They are exceptionally well suited for training at scale, compared to deep transfer learning approaches. Our best method, MLSP-VQA-FF, improves the Spearman rank-order correlation coefficient (SRCC) performance metric on the commonly used KoNViD-1k in-the-wild benchmark dataset to 0.82. It surpasses the best existing deep-learning model (0.80 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method for videos in-the-wild. Finally, we show that the MLSP-VQA models trained on KonVid-150k sets the new state-of-the-art for cross-test performance on KoNViD-1k and LIVE-Qualcomm with a 0.83 and 0.64 SRCC, respectively. For KoNViD-1k this inter-dataset testing outperforms intra-dataset experiments, showing excellent generalization