73 research outputs found

    Edge-Aware Extended Star-Tetrix Transforms for CFA-Sampled Raw Camera Image Compression

    Full text link
    Codecs using spectral-spatial transforms efficiently compress raw camera images captured with a color filter array (CFA-sampled raw images) by changing their RGB color space into a decorrelated color space. This study describes two types of spectral-spatial transform, called extended Star-Tetrix transforms (XSTTs), and their edge-aware versions, called edge-aware XSTTs (EXSTTs), with no extra bits (side information) and little extra complexity. They are obtained by (i) extending the Star-Tetrix transform (STT), which is one of the latest spectral-spatial transforms, to a new version of our previously proposed wavelet-based spectral-spatial transform and a simpler version, (ii) considering that each 2-D predict step of the wavelet transform is a combination of two 1-D diagonal or horizontal-vertical transforms, and (iii) weighting the transforms along the edge directions in the images. Compared with XSTTs, the EXSTTs can decorrelate CFA-sampled raw images well: they reduce the difference in energy between the two green components by about 3.383.38--30.0830.08 \% for high-quality camera images and 8.978.97--14.4714.47 \% for mobile phone images. The experiments on JPEG 2000-based lossless and lossy compression of CFA-sampled raw images show better performance than conventional methods. For high-quality camera images, the XSTTs/EXSTTs produce results equal to or better than the conventional methods: especially for images with many edges, the type-I EXSTT improves them by about 0.030.03--0.190.19 bpp in average lossless bitrate and the XSTTs improve them by about 0.160.16--0.960.96 dB in average Bj\o ntegaard delta peak signal-to-noise ratio. For mobile phone images, our previous work perform the best, whereas the XSTTs/EXSTTs show similar trends to the case of high-quality camera images

    Pruned Lightweight Encoders for Computer Vision

    Get PDF
    Latency-critical computer vision systems, such as autonomous driving or drone control, require fast image or video compression when offloading neural network inference to a remote computer. To ensure low latency on a near-sensor edge device, we propose the use of lightweight encoders with constant bitrate and pruned encoding configurations, namely, ASTC and JPEG XS. Pruning introduces significant distortion which we show can be recovered by retraining the neural network with compressed data after decompression. Such an approach does not modify the network architecture or require coding format modifications. By retraining with compressed datasets, we reduced the classification accuracy and segmentation mean intersection over union (mIoU) degradation due to ASTC compression to 4.9-5.0 percentage points (pp) and 4.4-4.0 pp, respectively. With the same method, the mIoU lost due to JPEG XS compression at the main profile was restored to 2.7-2.3 pp. In terms of encoding speed, our ASTC encoder implementation is 2.3x faster than JPEG. Even though the JPEG XS reference encoder requires optimizations to reach low latency, we showed that disabling significance flag coding saves 22-23% of encoding time at the cost of 0.4-0.3 mIoU after retraining.acceptedVersionPeer reviewe

    Compression, pose tracking, and halftoning

    Get PDF
    In this thesis, we discuss image compression, pose tracking, and halftoning. Although these areas seem to be unrelated at first glance, they can be connected through video coding as application scenario. Our first contribution is an image compression algorithm based on a rectangular subdivision scheme which stores only a small subsets of the image points. From these points, the remained of the image is reconstructed using partial differential equations. Afterwards, we present a pose tracking algorithm that is able to follow the 3-D position and orientation of multiple objects simultaneously. The algorithm can deal with noisy sequences, and naturally handles both occlusions between different objects, as well as occlusions occurring in kinematic chains. Our third contribution is a halftoning algorithm based on electrostatic principles, which can easily be adjusted to different settings through a number of extensions. Examples include modifications to handle varying dot sizes or hatching. In the final part of the thesis, we show how to combine our image compression, pose tracking, and halftoning algorithms to novel video compression codecs. In each of these four topics, our algorithms yield excellent results that outperform those of other state-of-the-art algorithms.In dieser Arbeit werden die auf den ersten Blick vollkommen voneinander unabhängig erscheinenden Bereiche Bildkompression, 3D-Posenschätzung und Halbtonverfahren behandelt und im Bereich der Videokompression sinnvoll zusammengeführt. Unser erster Beitrag ist ein Bildkompressionsalgorithmus, der auf einem rechteckigen Unterteilungsschema basiert. Dieser Algorithmus speichert nur eine kleine Teilmenge der im Bild vorhandenen Punkte, während die restlichen Punkte mittels partieller Differentialgleichungen rekonstruiert werden. Danach stellen wir ein Posenschätzverfahren vor, welches die 3D-Position und Ausrichtung von mehreren Objekten anhand von Bilddaten gleichzeitig verfolgen kann. Unser Verfahren funktioniert bei verrauschten Videos und im Falle von Objektüberlagerungen. Auch Verdeckungen innerhalb einer kinematischen Kette werden natürlich behandelt. Unser dritter Beitrag ist ein Halbtonverfahren, das auf elektrostatischen Prinzipien beruht. Durch eine Reihe von Erweiterungen kann dieses Verfahren flexibel an verschiedene Szenarien angepasst werden. So ist es beispielsweise möglich, verschiedene Punktgrößen zu verwenden oder Schraffuren zu erzeugen. Der letzte Teil der Arbeit zeigt, wie man unseren Bildkompressionsalgorithmus, unser Posenschätzverfahren und unser Halbtonverfahren zu neuen Videokompressionsalgorithmen kombinieren kann. Die für jeden der vier Themenbereiche entwickelten Verfahren erzielen hervorragende Resultate, welche die Ergebnisse anderer moderner Verfahren übertreffen

    Joint Demosaicking / Rectification of Fisheye Camera Images using Multi-color Graph Laplacian Regulation

    Get PDF
    To compose one 360 degrees image from multiple viewpoint images taken from different fisheye cameras on a rig for viewing on a head-mounted display (HMD), a conventional processing pipeline first performs demosaicking on each fisheye camera's Bayer-patterned grid, then translates demosaicked pixels from the camera grid to a rectified image grid. By performing two image interpolation steps in sequence, interpolation errors can accumulate, and acquisition noise in each captured pixel can pollute its neighbors, resulting in correlated noise. In this paper, a joint processing framework is proposed that performs demosaicking and grid-to-grid mapping simultaneously, thus limiting noise pollution to one interpolation. Specifically, a reverse mapping function is first obtained from a regular on-grid location in the rectified image to an irregular off-grid location in the camera's Bayer-patterned image. For each pair of adjacent pixels in the rectified grid, its gradient is estimated using the pair's neighboring pixel gradients in three colors in the Bayer-patterned grid. A similarity graph is constructed based on the estimated gradients, and pixels are interpolated in the rectified grid directly via graph Laplacian regularization (GLR). To establish ground truth for objective testing, a large dataset containing pairs of simulated images both in the fisheye camera grid and the rectified image grid is built. Experiments show that the proposed joint demosaicking / rectification method outperforms competing schemes that execute demosaicking and rectification in sequence in both objective and subjective measures

    Joint Demosaicking / Rectification of Fisheye Camera Images using Multi-color Graph Laplacian Regulation

    Get PDF
    To compose one 360 degrees image from multiple viewpoint images taken from different fisheye cameras on a rig for viewing on a head-mounted display (HMD), a conventional processing pipeline first performs demosaicking on each fisheye camera's Bayer-patterned grid, then translates demosaicked pixels from the camera grid to a rectified image grid. By performing two image interpolation steps in sequence, interpolation errors can accumulate, and acquisition noise in each captured pixel can pollute its neighbors, resulting in correlated noise. In this paper, a joint processing framework is proposed that performs demosaicking and grid-to-grid mapping simultaneously, thus limiting noise pollution to one interpolation. Specifically, a reverse mapping function is first obtained from a regular on-grid location in the rectified image to an irregular off-grid location in the camera's Bayer-patterned image. For each pair of adjacent pixels in the rectified grid, its gradient is estimated using the pair's neighboring pixel gradients in three colors in the Bayer-patterned grid. A similarity graph is constructed based on the estimated gradients, and pixels are interpolated in the rectified grid directly via graph Laplacian regularization (GLR). To establish ground truth for objective testing, a large dataset containing pairs of simulated images both in the fisheye camera grid and the rectified image grid is built. Experiments show that the proposed joint demosaicking / rectification method outperforms competing schemes that execute demosaicking and rectification in sequence in both objective and subjective measures

    Methods for Light Field Display Profiling and Scalable Super-Multiview Video Coding

    Get PDF
    Light field 3D displays reproduce the light field of real or synthetic scenes, as observed by multiple viewers, without the necessity of wearing 3D glasses. Reproducing light fields is a technically challenging task in terms of optical setup, content creation, distributed rendering, among others; however, the impressive visual quality of hologramlike scenes, in full color, with real-time frame rates, and over a very wide field of view justifies the complexity involved. Seeing objects popping far out from the screen plane without glasses impresses even those viewers who have experienced other 3D displays before.Content for these displays can either be synthetic or real. The creation of synthetic (rendered) content is relatively well understood and used in practice. Depending on the technique used, rendering has its own complexities, quite similar to the complexity of rendering techniques for 2D displays. While rendering can be used in many use-cases, the holy grail of all 3D display technologies is to become the future 3DTVs, ending up in each living room and showing realistic 3D content without glasses. Capturing, transmitting, and rendering live scenes as light fields is extremely challenging, and it is necessary if we are about to experience light field 3D television showing real people and natural scenes, or realistic 3D video conferencing with real eye-contact.In order to provide the required realism, light field displays aim to provide a wide field of view (up to 180°), while reproducing up to ~80 MPixels nowadays. Building gigapixel light field displays is realistic in the next few years. Likewise, capturing live light fields involves using many synchronized cameras that cover the same display wide field of view and provide the same high pixel count. Therefore, light field capture and content creation has to be well optimized with respect to the targeted display technologies. Two major challenges in this process are addressed in this dissertation.The first challenge is how to characterize the display in terms of its capabilities to create light fields, that is how to profile the display in question. In clearer terms this boils down to finding the equivalent spatial resolution, which is similar to the screen resolution of 2D displays, and angular resolution, which describes the smallest angle, the color of which the display can control individually. Light field is formalized as 4D approximation of the plenoptic function in terms of geometrical optics through spatiallylocalized and angularly-directed light rays in the so-called ray space. Plenoptic Sampling Theory provides the required conditions to sample and reconstruct light fields. Subsequently, light field displays can be characterized in the Fourier domain by the effective display bandwidth they support. In the thesis, a methodology for displayspecific light field analysis is proposed. It regards the display as a signal processing channel and analyses it as such in spectral domain. As a result, one is able to derive the display throughput (i.e. the display bandwidth) and, subsequently, the optimal camera configuration to efficiently capture and filter light fields before displaying them.While the geometrical topology of optical light sources in projection-based light field displays can be used to theoretically derive display bandwidth, and its spatial and angular resolution, in many cases this topology is not available to the user. Furthermore, there are many implementation details which cause the display to deviate from its theoretical model. In such cases, profiling light field displays in terms of spatial and angular resolution has to be done by measurements. Measurement methods that involve the display showing specific test patterns, which are then captured by a single static or moving camera, are proposed in the thesis. Determining the effective spatial and angular resolution of a light field display is then based on an automated analysis of the captured images, as they are reproduced by the display, in the frequency domain. The analysis reveals the empirical limits of the display in terms of pass-band both in the spatial and angular dimension. Furthermore, the spatial resolution measurements are validated by subjective tests confirming that the results are in line with the smallest features human observers can perceive on the same display. The resolution values obtained can be used to design the optimal capture setup for the display in question.The second challenge is related with the massive number of views and pixels captured that have to be transmitted to the display. It clearly requires effective and efficient compression techniques to fit in the bandwidth available, as an uncompressed representation of such a super-multiview video could easily consume ~20 gigabits per second with today’s displays. Due to the high number of light rays to be captured, transmitted and rendered, distributed systems are necessary for both capturing and rendering the light field. During the first attempts to implement real-time light field capturing, transmission and rendering using a brute force approach, limitations became apparent. Still, due to the best possible image quality achievable with dense multi-camera light field capturing and light ray interpolation, this approach was chosen as the basis of further work, despite the massive amount of bandwidth needed. Decompression of all camera images in all rendering nodes, however, is prohibitively time consuming and is not scalable. After analyzing the light field interpolation process and the data-access patterns typical in a distributed light field rendering system, an approach to reduce the amount of data required in the rendering nodes has been proposed. This approach, on the other hand, requires rectangular parts (typically vertical bars in case of a Horizontal Parallax Only light field display) of the captured images to be available in the rendering nodes, which might be exploited to reduce the time spent with decompression of video streams. However, partial decoding is not readily supported by common image / video codecs. In the thesis, approaches aimed at achieving partial decoding are proposed for H.264, HEVC, JPEG and JPEG2000 and the results are compared.The results of the thesis on display profiling facilitate the design of optimal camera setups for capturing scenes to be reproduced on 3D light field displays. The developed super-multiview content encoding also facilitates light field rendering in real-time. This makes live light field transmission and real-time teleconferencing possible in a scalable way, using any number of cameras, and at the spatial and angular resolution the display actually needs for achieving a compelling visual experience
    • …
    corecore