154 research outputs found

    High Performance Multiview Video Coding

    Get PDF
    Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications

    Distributed Video Coding for Multiview and Video-plus-depth Coding

    Get PDF

    Scalable light field representation and coding

    Get PDF
    This Thesis aims to advance the state-of-the-art in light field representation and coding. In this context, proposals to improve functionalities like light field random access and scalability are also presented. As the light field representation constrains the coding approach to be used, several light field coding techniques to exploit the inherent characteristics of the most popular types of light field representations are proposed and studied, which are normally based on micro-images or sub-aperture-images. To encode micro-images, two solutions are proposed, aiming to exploit the redundancy between neighboring micro-images using a high order prediction model, where the model parameters are either explicitly transmitted or inferred at the decoder, respectively. In both cases, the proposed solutions are able to outperform low order prediction solutions. To encode sub-aperture-images, an HEVC-based solution that exploits their inherent intra and inter redundancies is proposed. In this case, the light field image is encoded as a pseudo video sequence, where the scanning order is signaled, allowing the encoder and decoder to optimize the reference picture lists to improve coding efficiency. A novel hybrid light field representation coding approach is also proposed, by exploiting the combined use of both micro-image and sub-aperture-image representation types, instead of using each representation individually. In order to aid the fast deployment of the light field technology, this Thesis also proposes scalable coding and representation approaches that enable adequate compatibility with legacy displays (e.g., 2D, stereoscopic or multiview) and with future light field displays, while maintaining high coding efficiency. Additionally, viewpoint random access, allowing to improve the light field navigation and to reduce the decoding delay, is also enabled with a flexible trade-off between coding efficiency and viewpoint random access.Esta Tese tem como objetivo avançar o estado da arte em representação e codificação de campos de luz. Neste contexto, são também apresentadas propostas para melhorar funcionalidades como o acesso aleatório ao campo de luz e a escalabilidade. Como a representação do campo de luz limita a abordagem de codificação a ser utilizada, são propostas e estudadas várias técnicas de codificação de campos de luz para explorar as características inerentes aos seus tipos mais populares de representação, que são normalmente baseadas em micro-imagens ou imagens de sub-abertura. Para codificar as micro-imagens, são propostas duas soluções, visando explorar a redundância entre micro-imagens vizinhas utilizando um modelo de predição de alta ordem, onde os parâmetros do modelo são explicitamente transmitidos ou inferidos no decodificador, respetivamente. Em ambos os casos, as soluções propostas são capazes de superar as soluções de predição de baixa ordem. Para codificar imagens de sub-abertura, é proposta uma solução baseada em HEVC que explora a inerente redundância intra e inter deste tipo de imagens. Neste caso, a imagem do campo de luz é codificada como uma pseudo-sequência de vídeo, onde a ordem de varrimento é sinalizada, permitindo ao codificador e decodificador otimizar as listas de imagens de referência para melhorar a eficiência da codificação. Também é proposta uma nova abordagem de codificação baseada na representação híbrida do campo de luz, explorando o uso combinado dos tipos de representação de micro-imagem e sub-imagem, em vez de usar cada representação individualmente. A fim de facilitar a rápida implantação da tecnologia de campo de luz, esta Tese também propõe abordagens escaláveis de codificação e representação que permitem uma compatibilidade adequada com monitores tradicionais (e.g., 2D, estereoscópicos ou multivista) e com futuros monitores de campo de luz, mantendo ao mesmo tempo uma alta eficiência de codificação. Além disso, o acesso aleatório de pontos de vista, permitindo melhorar a navegação no campo de luz e reduzir o atraso na descodificação, também é permitido com um equilíbrio flexível entre eficiência de codificação e acesso aleatório de pontos de vista

    Discontinuity-Aware Base-Mesh Modeling of Depth for Scalable Multiview Image Synthesis and Compression

    Full text link
    This thesis is concerned with the challenge of deriving disparity from sparsely communicated depth for performing disparity-compensated view synthesis for compression and rendering of multiview images. The modeling of depth is essential for deducing disparity at view locations where depth is not available and is also critical for visibility reasoning and occlusion handling. This thesis first explores disparity derivation methods and disparity-compensated view synthesis approaches. Investigations reveal the merits of adopting a piece-wise continuous mesh description of depth for deriving disparity at target view locations to enable disparity-compensated backward warping of texture. Visibility information can be reasoned due to the correspondence relationship between views that a mesh model provides, while the connectivity of a mesh model assists in resolving depth occlusion. The recent JPEG 2000 Part-17 extension defines tools for scalable coding of discontinuous media using breakpoint-dependent DWT, where breakpoints describe discontinuity boundary geometry. This thesis proposes a method to efficiently reconstruct depth coded using JPEG 2000 Part-17 as a piece-wise continuous mesh, where discontinuities are driven by the encoded breakpoints. Results show that the proposed mesh can accurately represent decoded depth while its complexity scales along with decoded depth quality. The piece-wise continuous mesh model anchored at a single viewpoint or base-view can be augmented to form a multi-layered structure where the underlying layers carry depth information of regions that are occluded at the base-view. Such a consolidated mesh representation is termed a base-mesh model and can be projected to many viewpoints, to deduce complete disparity fields between any pair of views that are inherently consistent. Experimental results demonstrate the superior performance of the base-mesh model in multiview synthesis and compression compared to other state-of-the-art methods, including the JPEG Pleno light field codec. The proposed base-mesh model departs greatly from conventional pixel-wise or block-wise depth models and their forward depth mapping for deriving disparity ingrained in existing multiview processing systems. When performing disparity-compensated view synthesis, there can be regions for which reference texture is unavailable, and inpainting is required. A new depth-guided texture inpainting algorithm is proposed to restore occluded texture in regions where depth information is either available or can be inferred using the base-mesh model

    Hierarchical representation and coding of surfaces using 3-D polygon meshes

    Full text link

    Codage de cartes de profondeur par deformation de courbes elastiques

    Get PDF
    In multiple-view video plus depth, depth maps can be represented by means of grayscale images and the corresponding temporal sequence can be thought as a standard grayscale video sequence. However depth maps have different properties from natural images: they present large areas of smooth surfaces separated by sharp edges. Arguably the most important information lies in object contours, as a consequence an interesting approach consists in performing a lossless coding of the contour map, possibly followed by a lossy coding of per-object depth values.In this context, we propose a new technique for the lossless coding of object contours, based on the elastic deformation of curves. A continuous evolution of elastic deformations between two reference contour curves can be modelled, and an elastically deformed version of the reference contours can be sent to the decoder with an extremely small coding cost and used as side information to improve the lossless coding of the actual contour. After the main discontinuities have been captured by the contour description, the depth field inside each region is rather smooth. We proposed and tested two different techniques for the coding of the depth field inside each region. The first technique performs the shape-adaptive wavelet transform followed by the shape-adaptive version of SPIHT. The second technique performs a prediction of the depth field from its subsampled version and the set of coded contours. It is generally recognized that a high quality view rendering at the receiver side is possible only by preserving the contour information, since distortions on edges during the encoding step would cause a sensible degradation on the synthesized view and on the 3D perception. We investigated this claim by conducting a subjective quality assessment test to compare an object-based technique and a hybrid block-based techniques for the coding of depth maps.Dans le format multiple-view video plus depth, les cartes de profondeur peuvent être représentées comme des images en niveaux de gris et la séquence temporelle correspondante peut être considérée comme une séquence vidéo standard en niveaux de gris. Cependant les cartes de profondeur ont des propriétés différentes des images naturelles: ils présentent de grandes surfaces lisses séparées par des arêtes vives. On peut dire que l'information la plus importante réside dans les contours de l'objet, en conséquence une approche intéressante consiste à effectuer un codage sans perte de la carte de contour, éventuellement suivie d'un codage lossy des valeurs de profondeur par-objet.Dans ce contexte, nous proposons une nouvelle technique pour le codage sans perte des contours de l'objet, basée sur la déformation élastique des courbes. Une évolution continue des déformations élastiques peut être modélisée entre deux courbes de référence, et une version du contour déformée élastiquement peut être envoyé au décodeur avec un coût de codage très faible et utilisé comme information latérale pour améliorer le codage sans perte du contour réel. Après que les principales discontinuités ont été capturés par la description du contour, la profondeur à l'intérieur de chaque région est assez lisse. Nous avons proposé et testé deux techniques différentes pour le codage du champ de profondeur à l'intérieur de chaque région. La première technique utilise la version adaptative à la forme de la transformation en ondelette, suivie par la version adaptative à la forme de SPIHT.La seconde technique effectue une prédiction du champ de profondeur à partir de sa version sous-échantillonnée et l'ensemble des contours codés. Il est généralement reconnu qu'un rendu de haute qualité au récepteur pour un nouveau point de vue est possible que avec la préservation de l'information de contour, car des distorsions sur les bords lors de l'étape de codage entraînerait une dégradation évidente sur la vue synthétisée et sur la perception 3D. Nous avons étudié cette affirmation en effectuant un test d'évaluation de la qualité perçue en comparant, pour le codage des cartes de profondeur, une technique basée sur la compression d'objects et une techniques de codage vidéo hybride à blocs

    Fehlerkaschierte Bildbasierte Darstellungsverfahren

    Get PDF
    Creating photo-realistic images has been one of the major goals in computer graphics since its early days. Instead of modeling the complexity of nature with standard modeling tools, image-based approaches aim at exploiting real-world footage directly,as they are photo-realistic by definition. A drawback of these approaches has always been that the composition or combination of different sources is a non-trivial task, often resulting in annoying visible artifacts. In this thesis we focus on different techniques to diminish visible artifacts when combining multiple images in a common image domain. The results are either novel images, when dealing with the composition task of multiple images, or novel video sequences rendered in real-time, when dealing with video footage from multiple cameras.Fotorealismus ist seit jeher eines der großen Ziele in der Computergrafik. Anstatt die Komplexität der Natur mit standardisierten Modellierungswerkzeugen nachzubauen, gehen bildbasierte Ansätze den umgekehrten Weg und verwenden reale Bildaufnahmen zur Modellierung, da diese bereits per Definition fotorealistisch sind. Ein Nachteil dieser Variante ist jedoch, dass die Komposition oder Kombination mehrerer Quellbilder eine nichttriviale Aufgabe darstellt und häufig unangenehm auffallende Artefakte im erzeugten Bild nach sich zieht. In dieser Dissertation werden verschiedene Ansätze verfolgt, um Artefakte zu verhindern oder abzuschwächen, welche durch die Komposition oder Kombination mehrerer Bilder in einer gemeinsamen Bilddomäne entstehen. Im Ergebnis liefern die vorgestellten Verfahren neue Bilder oder neue Ansichten einer Bildsammlung oder Videosequenz, je nachdem, ob die jeweilige Aufgabe die Komposition mehrerer Bilder ist oder die Kombination mehrerer Videos verschiedener Kameras darstellt

    Monocular depth estimation in images and sequences using occlusion cues

    Get PDF
    When humans observe a scene, they are able to perfectly distinguish the different parts composing it. Moreover, humans can easily reconstruct the spatial position of these parts and conceive a consistent structure. The mechanisms involving visual perception have been studied since the beginning of neuroscience but, still today, not all the processes composing it are known. In usual situations, humans can make use of three different methods to estimate the scene structure. The first one is the so called divergence and it makes use of both eyes. When objects lie in front of the observed at a distance up to hundred meters, subtle differences in the image formation in each eye can be used to determine depth. When objects are not in the field of view of both eyes, other mechanisms should be used. In these cases, both visual cues and prior learned information can be used to determine depth. Even if these mechanisms are less accurate than divergence, humans can almost always infer the correct depth structure when using them. As an example of visual cues, occlusion, perspective or object size provide a lot of information about the structure of the scene. A priori information depends on each observer, but it is normally used subconsciously by humans to detect commonly known regions such as the sky, the ground or different types of objects. In the last years, since technology has been able to handle the processing burden of vision systems, there has been lots of efforts devoted to design automated scene interpreting systems. In this thesis we address the problem of depth estimation using only one point of view and using only occlusion depth cues. The thesis objective is to detect occlusions present in the scene and combine them with a segmentation system so as to generate a relative depth order depth map for a scene. We explore both static and dynamic situations such as single images, frame inside sequences or full video sequences. In the case where a full image sequence is available, a system exploiting motion information to recover depth structure is also designed. Results are promising and competitive with respect to the state of the art literature, but there is still much room for improvement when compared to human depth perception performance.Quan els humans observen una escena, son capaços de distingir perfectament les parts que la composen i organitzar-les espacialment per tal de poder-se orientar. Els mecanismes que governen la percepció visual han estat estudiats des dels principis de la neurociència, però encara no es coneixen tots els processos biològic que hi prenen part. En situacions normals, els humans poden fer servir tres eines per estimar l’estructura de l’escena. La primera és l’anomenada divergència. Aprofita l’ús de dos punts de vista (els dos ulls) i és capaç¸ de determinar molt acuradament la posició dels objectes ,que a una distància de fins a cent metres, romanen enfront de l’observador. A mesura que augmenta la distància o els objectes no es troben en el camp de visió dels dos ulls, altres mecanismes s’han d’utilitzar. Tant l’experiència anterior com certs indicis visuals s’utilitzen en aquests casos i, encara que la seva precisió és menor, els humans aconsegueixen quasi bé sempre interpretar bé el seu entorn. Els indicis visuals que aporten informació de profunditat més coneguts i utilitzats són per exemple, la perspectiva, les oclusions o el tamany de certs objectes. L’experiència anterior permet resoldre situacions vistes anteriorment com ara saber quins regions corresponen al terra, al cel o a objectes. Durant els últims anys, quan la tecnologia ho ha permès, s’han intentat dissenyar sistemes que interpretessin automàticament diferents tipus d’escena. En aquesta tesi s’aborda el tema de l’estimació de la profunditat utilitzant només un punt de vista i indicis visuals d’oclusió. L’objectiu del treball es la detecció d’aquests indicis i combinar-los amb un sistema de segmentació per tal de generar automàticament els diferents plans de profunditat presents a una escena. La tesi explora tant situacions estàtiques (imatges fixes) com situacions dinàmiques, com ara trames dins de seqüències de vídeo o seqüències completes. En el cas de seqüències completes, també es proposa un sistema automàtic per reconstruir l’estructura de l’escena només amb informació de moviment. Els resultats del treball son prometedors i competitius amb la literatura del moment, però mostren encara que la visió per computador té molt marge de millora respecte la precisió dels humans

    Methods for Real-time Visualization and Interaction with Landforms

    Get PDF
    This thesis presents methods to enrich data modeling and analysis in the geoscience domain with a particular focus on geomorphological applications. First, a short overview of the relevant characteristics of the used remote sensing data and basics of its processing and visualization are provided. Then, two new methods for the visualization of vector-based maps on digital elevation models (DEMs) are presented. The first method uses a texture-based approach that generates a texture from the input maps at runtime taking into account the current viewpoint. In contrast to that, the second method utilizes the stencil buffer to create a mask in image space that is then used to render the map on top of the DEM. A particular challenge in this context is posed by the view-dependent level-of-detail representation of the terrain geometry. After suitable visualization methods for vector-based maps have been investigated, two landform mapping tools for the interactive generation of such maps are presented. The user can carry out the mapping directly on the textured digital elevation model and thus benefit from the 3D visualization of the relief. Additionally, semi-automatic image segmentation techniques are applied in order to reduce the amount of user interaction required and thus make the mapping process more efficient and convenient. The challenge in the adaption of the methods lies in the transfer of the algorithms to the quadtree representation of the data and in the application of out-of-core and hierarchical methods to ensure interactive performance. Although high-resolution remote sensing data are often available today, their effective resolution at steep slopes is rather low due to the oblique acquisition angle. For this reason, remote sensing data are suitable to only a limited extent for visualization as well as landform mapping purposes. To provide an easy way to supply additional imagery, an algorithm for registering uncalibrated photos to a textured digital elevation model is presented. A particular challenge in registering the images is posed by large variations in the photos concerning resolution, lighting conditions, seasonal changes, etc. The registered photos can be used to increase the visual quality of the textured DEM, in particular at steep slopes. To this end, a method is presented that combines several georegistered photos to textures for the DEM. The difficulty in this compositing process is to create a consistent appearance and avoid visible seams between the photos. In addition to that, the photos also provide valuable means to improve landform mapping. To this end, an extension of the landform mapping methods is presented that allows the utilization of the registered photos during mapping. This way, a detailed and exact mapping becomes feasible even at steep slopes

    Action recognition in visual sensor networks: a data fusion perspective

    Get PDF
    Visual Sensor Networks have emerged as a new technology to bring computer vision algorithms to the real world. However, they impose restrictions in the computational resources and bandwidth available to solve target problems. This thesis is concerned with the definition of new efficient algorithms to perform Human Action Recognition with Visual Sensor Networks. Human Action Recognition systems apply sequence modelling methods to integrate the temporal sensor measurements available. Among sequence modelling methods, the Hidden Conditional Random Field has shown a great performance in sequence classification tasks, outperforming many other methods. However, a parameter estimation procedure has not been proposed with feature and model selection properties. This thesis fills this lack proposing a new objective function to optimize during training. The L2 regularizer employed in the standard objective function is replaced by an overlapping group-L1 regularizer that produces feature and model selection effects in the optima. A gradient-based search strategy is proposed to find the optimal parameters of the objective function. Experimental evidence shows that Hidden Conditional Random Fields with their parameters estimated employing the proposed method have a higher predictive accuracy than those estimated with the standard method, with an smaller inference cost. This thesis also deals with the problem of human action recognition from multiple cameras, with the focus on reducing the amount of network bandwidth required. A multiple view dimensionality reduction framework is developed to obtain similar low dimensional representation for the motion descriptors extracted from multiple cameras. An alternative is proposed predicting the action class locally at each camera with the motion descriptors extracted from each view and integrating the different action decisions to make a global decision on the action performed. The reported experiments show that the proposed framework has a predictive performance similar to 3D state of the art methods, but with a lower computational complexity and lower bandwidth requirements. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Las Redes de Sensores Visuales son una nueva tecnología que permite el despliegue de algoritmos de visión por computador en el mundo real. Sin embargo, estas imponen restricciones en los recursos de computo y de ancho de banda disponibles para la resolución del problema en cuestión. Esta tesis tiene por objeto la definición de nuevos algoritmos con los que realizar reconocimiento de actividades humanas en redes de sensores visuales, teniendo en cuenta las restricciones planteadas. Los sistemas de reconocimiento de acciones aplican métodos de modelado de secuencias para la integración de las medidas temporales proporcionadas por los sensores. Entre los modelos para el modelado de secuencias, el Hidden Conditional Random Field a mostrado un gran rendimiento en la clasificación de secuencias, superando a otros métodos existentes. Sin embargo, no se ha definido un procedimiento para la integración de sus parámetros que incluya selección de atributos y selección de modelo. Esta tesis tiene por objeto cubrir esta carencia proponiendo una nueva función objetivo para optimizar durante la estimación de los parámetros obtimos. El regularizador L2 empleado en la función objetivo estandar se va a remplazar for un regularizador grupo-L1 solapado que va a producir los efectos de selección de modelo y atributos deseados en el óptimo. Se va a proponer una estrategia de búsqueda con la que obtener el valor óptimo de estos parámetros. Los experimentos realizados muestran que los modelos estimados utilizando la función objetivo prouesta tienen un mayor poder de predicción, reduciendo al mismo tiempo el coste computacional de la inferencia. Esta tesis también trata el problema del reconocimiento de acciones humanas emepleando multiples cámaras, centrándonos en reducir la cantidad de ancho de banda requerido par el proceso. Para ello se propone un nueva estructura en la que definir algoritmos de reducción de dimensionalidad para datos definidos en multiples vistas. Mediante su aplicación se obtienen representaciones de baja dimensionalidad similares para los descriptores de movimiento calculados en cada una de las cámaras.También se propone un método alternativo basado en la predicción de la acción realizada con los descriptores obtenidos en cada una de las cámaras, para luego combinar las diferentes predicciones en una global. La experimentación realizada muestra que estos métodos tienen una eficacia similar a la alcanzada por los métodos existentes basados en reconstrucción 3D, pero con una menor complejidad computacional y un menor uso de la red
    corecore