9,314 research outputs found

    Recover Subjective Quality Scores from Noisy Measurements

    Full text link
    Simple quality metrics such as PSNR are known to not correlate well with subjective quality when tested across a wide spectrum of video content or quality regime. Recently, efforts have been made in designing objective quality metrics trained on subjective data (e.g. VMAF), demonstrating better correlation with video quality perceived by human. Clearly, the accuracy of such a metric heavily depends on the quality of the subjective data that it is trained on. In this paper, we propose a new approach to recover subjective quality scores from noisy raw measurements, using maximum likelihood estimation, by jointly estimating the subjective quality of impaired videos, the bias and consistency of test subjects, and the ambiguity of video contents all together. We also derive closed-from expression for the confidence interval of each estimate. Compared to previous methods which partially exploit the subjective information, our approach is able to exploit the information in full, yielding tighter confidence interval and better handling of outliers without the need for z-scoring or subject rejection. It also handles missing data more gracefully. Finally, as side information, it provides interesting insights on the test subjects and video contents.Comment: 16 pages; abridged version appeared in Data Compression Conference (DCC) 201

    Kuvanlaatukokemuksen arvionnin instrumentit

    Get PDF
    This dissertation describes the instruments available for image quality evaluation, develops new methods for subjective image quality evaluation and provides image and video databases for the assessment and development of image quality assessment (IQA) algorithms. The contributions of the thesis are based on six original publications. The first publication introduced the VQone toolbox for subjective image quality evaluation. It created a platform for free-form experimentation with standardized image quality methods and was the foundation for later studies. The second publication focused on the dilemma of reference in subjective experiments by proposing a new method for image quality evaluation: the absolute category rating with dynamic reference (ACR-DR). The third publication presented a database (CID2013) in which 480 images were evaluated by 188 observers using the ACR-DR method proposed in the prior publication. Providing databases of image files along with their quality ratings is essential in the field of IQA algorithm development. The fourth publication introduced a video database (CVD2014) based on having 210 observers rate 234 video clips. The temporal aspect of the stimuli creates peculiar artifacts and degradations, as well as challenges to experimental design and video quality assessment (VQA) algorithms. When the CID2013 and CVD2014 databases were published, most state-of-the-art I/VQAs had been trained on and tested against databases created by degrading an original image or video with a single distortion at a time. The novel aspect of CID2013 and CVD2014 was that they consisted of multiple concurrent distortions. To facilitate communication and understanding among professionals in various fields of image quality as well as among non-professionals, an attribute lexicon of image quality, the image quality wheel, was presented in the fifth publication of this thesis. Reference wheels and terminology lexicons have a long tradition in sensory evaluation contexts, such as taste experience studies, where they are used to facilitate communication among interested stakeholders; however, such an approach has not been common in visual experience domains, especially in studies on image quality. The sixth publication examined how the free descriptions given by the observers influenced the ratings of the images. Understanding how various elements, such as perceived sharpness and naturalness, affect subjective image quality can help to understand the decision-making processes behind image quality evaluation. Knowing the impact of each preferential attribute can then be used for I/VQA algorithm development; certain I/VQA algorithms already incorporate low-level human visual system (HVS) models in their algorithms.Väitöskirja tarkastelee ja kehittää uusia kuvanlaadun arvioinnin menetelmiä, sekä tarjoaa kuva- ja videotietokantoja kuvanlaadun arviointialgoritmien (IQA) testaamiseen ja kehittämiseen. Se, mikä koetaan kauniina ja miellyttävänä, on psykologisesti kiinnostava kysymys. Työllä on myös merkitystä teollisuuteen kameroiden kuvanlaadun kehittämisessä. Väitöskirja sisältää kuusi julkaisua, joissa tarkastellaan aihetta eri näkökulmista. I. julkaisussa kehitettiin sovellus keräämään ihmisten antamia arvioita esitetyistä kuvista tutkijoiden vapaaseen käyttöön. Se antoi mahdollisuuden testata standardoituja kuvanlaadun arviointiin kehitettyjä menetelmiä ja kehittää niiden pohjalta myös uusia menetelmiä luoden perustan myöhemmille tutkimuksille. II. julkaisussa kehitettiin uusi kuvanlaadun arviointimenetelmä. Menetelmä hyödyntää sarjallista kuvien esitystapaa, jolla muodostettiin henkilöille mielikuva kuvien laatuvaihtelusta ennen varsinaista arviointia. Tämän todettiin vähentävän tulosten hajontaa ja erottelevan pienempiä kuvanlaatueroja. III. julkaisussa kuvaillaan tietokanta, jossa on 188 henkilön 480 kuvasta antamat laatuarviot ja niihin liittyvät kuvatiedostot. Tietokannat ovat arvokas työkalu pyrittäessä kehittämään algoritmeja kuvanlaadun automaattiseen arvosteluun. Niitä tarvitaan mm. opetusmateriaalina tekoälyyn pohjautuvien algoritmien kehityksessä sekä vertailtaessa eri algoritmien suorituskykyä toisiinsa. Mitä paremmin algoritmin tuottama ennuste korreloi ihmisten antamiin laatuarvioihin, sen parempi suorituskyky sillä voidaan sanoa olevan. IV. julkaisussa esitellään tietokanta, jossa on 210 henkilön 234 videoleikkeestä tekemät laatuarviot ja niihin liittyvät videotiedostot. Ajallisen ulottuvuuden vuoksi videoärsykkeiden virheet ovat erilaisia kuin kuvissa, mikä tuo omat haasteensa videoiden laatua arvioiville algoritmeille (VQA). Aikaisempien tietokantojen ärsykkeet on muodostettu esimerkiksi sumentamalla yksittäistä kuvaa asteittain, jolloin ne sisältävät vain yksiulotteisia vääristymiä. Nyt esitetyt tietokannat poikkeavat aikaisemmista ja sisältävät useita samanaikaisia vääristymistä, joiden interaktio kuvanlaadulle voi olla merkittävää. V. julkaisussa esitellään kuvanlaatuympyrä (image quality wheel). Se on kuvanlaadun käsitteiden sanasto, joka on kerätty analysoimalla 146 henkilön tuottamat 39 415 kuvanlaadun sanallista kuvausta. Sanastoilla on pitkät perinteet aistinvaraisen arvioinnin tutkimusperinteessä, mutta niitä ei ole aikaisemmin kehitetty kuvanlaadulle. VI. tutkimuksessa tutkittiin, kuinka arvioitsijoiden antamat käsitteet vaikuttavat kuvien laadun arviointiin. Esimerkiksi kuvien arvioitu terävyys tai luonnollisuus auttaa ymmärtämään laadunarvioinnin taustalla olevia päätöksentekoprosesseja. Tietoa voidaan käyttää esimerkiksi kuvan- ja videonlaadun arviointialgoritmien (I/VQA) kehitystyössä

    Scene-Dependency of Spatial Image Quality Metrics

    Get PDF
    This thesis is concerned with the measurement of spatial imaging performance and the modelling of spatial image quality in digital capturing systems. Spatial imaging performance and image quality relate to the objective and subjective reproduction of luminance contrast signals by the system, respectively; they are critical to overall perceived image quality. The Modulation Transfer Function (MTF) and Noise Power Spectrum (NPS) describe the signal (contrast) transfer and noise characteristics of a system, respectively, with respect to spatial frequency. They are both, strictly speaking, only applicable to linear systems since they are founded upon linear system theory. Many contemporary capture systems use adaptive image signal processing, such as denoising and sharpening, to optimise output image quality. These non-linear processes change their behaviour according to characteristics of the input signal (i.e. the scene being captured). This behaviour renders system performance “scene-dependent” and difficult to measure accurately. The MTF and NPS are traditionally measured from test charts containing suitable predefined signals (e.g. edges, sinusoidal exposures, noise or uniform luminance patches). These signals trigger adaptive processes at uncharacteristic levels since they are unrepresentative of natural scene content. Thus, for systems using adaptive processes, the resultant MTFs and NPSs are not representative of performance “in the field” (i.e. capturing real scenes). Spatial image quality metrics for capturing systems aim to predict the relationship between MTF and NPS measurements and subjective ratings of image quality. They cascade both measures with contrast sensitivity functions that describe human visual sensitivity with respect to spatial frequency. The most recent metrics designed for adaptive systems use MTFs measured using the dead leaves test chart that is more representative of natural scene content than the abovementioned test charts. This marks a step toward modelling image quality with respect to real scene signals. This thesis presents novel scene-and-process-dependent MTFs (SPD-MTF) and NPSs (SPDNPS). They are measured from imaged pictorial scene (or dead leaves target) signals to account for system scene-dependency. Further, a number of spatial image quality metrics are revised to account for capture system and visual scene-dependency. Their MTF and NPS parameters were substituted for SPD-MTFs and SPD-NPSs. Likewise, their standard visual functions were substituted for contextual detection (cCSF) or discrimination (cVPF) functions. In addition, two novel spatial image quality metrics are presented (the log Noise Equivalent Quanta (NEQ) and Visual log NEQ) that implement SPD-MTFs and SPD-NPSs. The metrics, SPD-MTFs and SPD-NPSs were validated by analysing measurements from simulated image capture pipelines that applied either linear or adaptive image signal processing. The SPD-NPS measures displayed little evidence of measurement error, and the metrics performed most accurately when they used SPD-NPSs measured from images of scenes. The benefit of deriving SPD-MTFs from images of scenes was traded-off, however, against measurement bias. Most metrics performed most accurately with SPD-MTFs derived from dead leaves signals. Implementing the cCSF or cVPF did not increase metric accuracy. The log NEQ and Visual log NEQ metrics proposed in this thesis were highly competitive, outperforming metrics of the same genre. They were also more consistent than the IEEE P1858 Camera Phone Image Quality (CPIQ) metric when their input parameters were modified. The advantages and limitations of all performance measures and metrics were discussed, as well as their practical implementation and relevant applications

    Subjective Image Quality Evaluation Using the Softcopy Quality Ruler Method

    Get PDF
    Image quality is in essence based on subjective experience, and the involvement of human subjects is therefore necessary in one way or the other in order to make reliable assessments of it. This topic is of interest for Axis Communications AB -- the market leader in surveillance cameras. In this thesis, the softcopy quality ruler method described in ISO 20462 is evaluated through a study of one specific image quality problem; human preference in the noise reduction -- texture loss space. The method involves quality or attribute judgment by comparison of test images to a series of ordered, univariate (variation in one attribute only) reference images, called ''ruler images''. Sharpness is used as reference attribute for the ruler images. For the purpose of the thesis project, a working environment for performing subjective studies was set up. A total of 47 persons (observers) were invited to judge a set of test images varying in noise level and amount of noise reduction as well as scene content. From the data, confidence intervals were calculated using the non-parametric bootstrap method as well as using the normal distribution. The two methods gave very similar results, and thus it could be confirmed that the data is well described by a normal distribution. The results of the study indicate that it is possible to determine an optimum level of noise reduction for a given noise level. For increasing noise levels, the variances of the observer judgments increase, while they decrease for increasing noise reduction levels. This difference may be explained by making the observation that increasing noise reduction levels result in increasing texture blur, which appears more similar to sharpness loss in comparison with noise. It is reasonable to assume that the uncertainty in observer judgments should be lower when comparing images with similar degradations. Therefore, it might be argued that the variability of the judgments depends to a large extent on the perceived similarities between the ruler and test images, in terms of the attribute(s) considered. If the ruler images appear similar to the test images, the variances of the judgments will be lower than for less similar ruler images and test images, and also more strongly dependent on the number of observers. In order to reduce the number of observers when testing some particular image quality attribute(s), care should be taken to use ruler images varying in an attribute similar in appearance to the test images. Also noticeable are discrepancies between experienced (having experience in judging or evaluating images) and unexperienced observers, as concluded by a Welch t-test. When judging image quality, experienced observers at Axis, such as imaging engineers, camera designers etc., seem more tolerant to noise than unexperienced observers

    A case study in identifying acceptable bitrates for human face recognition tasks

    Get PDF
    Face recognition from images or video footage requires a certain level of recorded image quality. This paper derives acceptable bitrates (relating to levels of compression and consequently quality) of footage with human faces, using an industry implementation of the standard H.264/MPEG-4 AVC and the Closed-Circuit Television (CCTV) recording systems on London buses. The London buses application is utilized as a case study for setting up a methodology and implementing suitable data analysis for face recognition from recorded footage, which has been degraded by compression. The majority of CCTV recorders on buses use a proprietary format based on the H.264/MPEG-4 AVC video coding standard, exploiting both spatial and temporal redundancy. Low bitrates are favored in the CCTV industry for saving storage and transmission bandwidth, but they compromise the image usefulness of the recorded imagery. In this context, usefulness is determined by the presence of enough facial information remaining in the compressed image to allow a specialist to recognize a person. The investigation includes four steps: (1) Development of a video dataset representative of typical CCTV bus scenarios. (2) Selection and grouping of video scenes based on local (facial) and global (entire scene) content properties. (3) Psychophysical investigations to identify the key scenes, which are most affected by compression, using an industry implementation of H.264/MPEG-4 AVC. (4) Testing of CCTV recording systems on buses with the key scenes and further psychophysical investigations. The results showed a dependency upon scene content properties. Very dark scenes and scenes with high levels of spatial–temporal busyness were the most challenging to compress, requiring higher bitrates to maintain useful information

    Evaluation of changes in image appearance with changes in displayed image size

    Get PDF
    This research focused on the quantification of changes in image appearance when images are displayed at different image sizes on LCD devices. The final results provided in calibrated Just Noticeable Differences (JNDs) on relevant perceptual scales, allowing the prediction of sharpness and contrast appearance with changes in the displayed image size. A series of psychophysical experiments were conducted to enable appearance predictions. Firstly, a rank order experiment was carried out to identify the image attributes that were most affected by changes in displayed image size. Two digital cameras, exhibiting very different reproduction qualities, were employed to capture the same scenes, for the investigation of the effect of the original image quality on image appearance changes. A wide range of scenes with different scene properties was used as a test-set for the investigation of image appearance changes with scene type. The outcomes indicated that sharpness and contrast were the most important attributes for the majority of scene types and original image qualities. Appearance matching experiments were further conducted to quantify changes in perceived sharpness and contrast with respect to changes in the displayed image size. For the creation of sharpness matching stimuli, a set of frequency domain filters were designed to provide equal intervals in image quality, by taking into account the system’s Spatial Frequency Response (SFR) and the observation distance. For the creation of contrast matching stimuli, a series of spatial domain S-shaped filters were designed to provide equal intervals in image contrast, by gamma adjustments. Five displayed image sizes were investigated. Observers were always asked to match the appearance of the smaller version of each stimulus to its larger reference. Lastly, rating experiments were conducted to validate the derived JNDs in perceptual quality for both sharpness and contrast stimuli. Data obtained by these experiments finally converted into JND scales for each individual image attribute. Linear functions were fitted to the final data, which allowed the prediction of image appearance of images viewed at larger sizes than these investigated in this research

    Perceptual Ruler for Quantifying Speech Intelligibility in Cocktail Party Scenarios

    Get PDF
    Systems designed to enhance intelligibility of speech in noise are difficult to evaluate quantitatively because intelligibility is subjective and often requires feedback from large populations for consistent evaluations. Attempts to quantify the evaluation have included related measures such as the Speech Intelligibility Index. These require separating speech and noise signals, which precludes its use on experimental recordings. This thesis develops a procedure using an Intelligibility Ruler (IR) for efficiently quantifying intelligibility. A calibrated Mean Opinion Score (MOS) method is also implemented in order to compare repeatability over a population of 24 subjective listeners. Results showed that subjects using the IR consistently estimated SII values of the test samples with an average standard deviation of 0.0867 between subjects on a scale from zero to one and R2=0.9421. After a calibration procedure from a subset of subjects, the MOS method yielded similar results with an average standard deviation of 0.07620 and R2=0.9181.While results suggest good repeatability of the IR method over a broad range of subjects, the calibrated MOS method is capable of producing results more closely related to actual SII values and is a simpler procedure for human subjects

    A Review of Spinal ROM Measurement Tools

    Get PDF
    Physical therapists rely on measurements to communicate with one another, establish patient status, predict treatment response, document treatment efficacy, and claim scientific credibility for the profession. Therefore, the quality of measurements should be of great concern to physical therapists and, hence, therapists should be able to examine the quality of measurement tools they are using critically. A variety of measurement tools are being utilized in physical therapy to quantify spinal mobility; however, there is no clarity as to which of the tools are optimal. In particular, the spinal range of motion measurement tools will be examined because of the high occurrence and high cost of low back injuries. The spinal range of motion measurement tools reviewed in this study include goniometers, flexible rulers, inclinometers, motion analysis systems, the Isotechnologies B-200, and the Spinoscope. The use of each of these measurement tools has advantages and disadvantages in a clinical setting. The reliability and validity of a measurement tool should be the most important considerations, but individual clinical needs and available resources also need to be considered when choosing an appropriate spinal range of motion measurement tool. If all these factors are considered, the author recommends the use of inclinometers since many studies show the inclinometer to be both reliable and valid. The EDI 320, in particular, is recommended for its ease of application. Finally, even if a tool is shown to be reliable and valid, established protocols for measurement techniques should be followed by each clinical staff member

    Describing Subjective Experiment Consistency by pp-Value P-P Plot

    Full text link
    There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct pp-value P-P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure that conclusions they draw and tools they build are more precise and trustworthy. The proposed approach works in line with expectations drawn solely on experiment design descriptions of 21 real-life multimedia quality subjective experiments.Comment: 11 pages, 3 figures. Accepted to 28th ACM International Conference on Multimedia (MM '20). For associated data sets, source codes and documentation, see https://github.com/Qub3k/subjective-exp-consistency-chec
    corecore