76 research outputs found

    Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer

    Full text link
    Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.Comment: 10 pages in Conference on Computer Vision and Pattern Recognition (CVPR), 201

    Scalable light field representation and coding

    Get PDF
    This Thesis aims to advance the state-of-the-art in light field representation and coding. In this context, proposals to improve functionalities like light field random access and scalability are also presented. As the light field representation constrains the coding approach to be used, several light field coding techniques to exploit the inherent characteristics of the most popular types of light field representations are proposed and studied, which are normally based on micro-images or sub-aperture-images. To encode micro-images, two solutions are proposed, aiming to exploit the redundancy between neighboring micro-images using a high order prediction model, where the model parameters are either explicitly transmitted or inferred at the decoder, respectively. In both cases, the proposed solutions are able to outperform low order prediction solutions. To encode sub-aperture-images, an HEVC-based solution that exploits their inherent intra and inter redundancies is proposed. In this case, the light field image is encoded as a pseudo video sequence, where the scanning order is signaled, allowing the encoder and decoder to optimize the reference picture lists to improve coding efficiency. A novel hybrid light field representation coding approach is also proposed, by exploiting the combined use of both micro-image and sub-aperture-image representation types, instead of using each representation individually. In order to aid the fast deployment of the light field technology, this Thesis also proposes scalable coding and representation approaches that enable adequate compatibility with legacy displays (e.g., 2D, stereoscopic or multiview) and with future light field displays, while maintaining high coding efficiency. Additionally, viewpoint random access, allowing to improve the light field navigation and to reduce the decoding delay, is also enabled with a flexible trade-off between coding efficiency and viewpoint random access.Esta Tese tem como objetivo avançar o estado da arte em representação e codificação de campos de luz. Neste contexto, são também apresentadas propostas para melhorar funcionalidades como o acesso aleatório ao campo de luz e a escalabilidade. Como a representação do campo de luz limita a abordagem de codificação a ser utilizada, são propostas e estudadas várias técnicas de codificação de campos de luz para explorar as características inerentes aos seus tipos mais populares de representação, que são normalmente baseadas em micro-imagens ou imagens de sub-abertura. Para codificar as micro-imagens, são propostas duas soluções, visando explorar a redundância entre micro-imagens vizinhas utilizando um modelo de predição de alta ordem, onde os parâmetros do modelo são explicitamente transmitidos ou inferidos no decodificador, respetivamente. Em ambos os casos, as soluções propostas são capazes de superar as soluções de predição de baixa ordem. Para codificar imagens de sub-abertura, é proposta uma solução baseada em HEVC que explora a inerente redundância intra e inter deste tipo de imagens. Neste caso, a imagem do campo de luz é codificada como uma pseudo-sequência de vídeo, onde a ordem de varrimento é sinalizada, permitindo ao codificador e decodificador otimizar as listas de imagens de referência para melhorar a eficiência da codificação. Também é proposta uma nova abordagem de codificação baseada na representação híbrida do campo de luz, explorando o uso combinado dos tipos de representação de micro-imagem e sub-imagem, em vez de usar cada representação individualmente. A fim de facilitar a rápida implantação da tecnologia de campo de luz, esta Tese também propõe abordagens escaláveis de codificação e representação que permitem uma compatibilidade adequada com monitores tradicionais (e.g., 2D, estereoscópicos ou multivista) e com futuros monitores de campo de luz, mantendo ao mesmo tempo uma alta eficiência de codificação. Além disso, o acesso aleatório de pontos de vista, permitindo melhorar a navegação no campo de luz e reduzir o atraso na descodificação, também é permitido com um equilíbrio flexível entre eficiência de codificação e acesso aleatório de pontos de vista

    Rich probabilistic models for semantic labeling

    Get PDF
    Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung

    Tree-structured graphical models and image analysis

    Get PDF

    JOINT CODING OF MULTIMODAL BIOMEDICAL IMAGES US ING CONVOLUTIONAL NEURAL NETWORKS

    Get PDF
    The massive volume of data generated daily by the gathering of medical images with different modalities might be difficult to store in medical facilities and share through communication networks. To alleviate this issue, efficient compression methods must be implemented to reduce the amount of storage and transmission resources required in such applications. However, since the preservation of all image details is highly important in the medical context, the use of lossless image compression algorithms is of utmost importance. This thesis presents the research results on a lossless compression scheme designed to encode both computerized tomography (CT) and positron emission tomography (PET). Different techniques, such as image-to-image translation, intra prediction, and inter prediction are used. Redundancies between both image modalities are also investigated. To perform the image-to-image translation approach, we resort to lossless compression of the original CT data and apply a cross-modality image translation generative adversarial network to obtain an estimation of the corresponding PET. Two approaches were implemented and evaluated to determine a PET residue that will be compressed along with the original CT. In the first method, the residue resulting from the differences between the original PET and its estimation is encoded, whereas in the second method, the residue is obtained using encoders inter-prediction coding tools. Thus, in alternative to compressing two independent picture modalities, i.e., both images of the original PET-CT pair solely the CT is independently encoded alongside with the PET residue, in the proposed method. Along with the proposed pipeline, a post-processing optimization algorithm that modifies the estimated PET image by altering the contrast and rescaling the image is implemented to maximize the compression efficiency. Four different versions (subsets) of a publicly available PET-CT pair dataset were tested. The first proposed subset was used to demonstrate that the concept developed in this work is capable of surpassing the traditional compression schemes. The obtained results showed gains of up to 8.9% using the HEVC. On the other side, JPEG2k proved not to be the most suitable as it failed to obtain good results, having reached only -9.1% compression gain. For the remaining (more challenging) subsets, the results reveal that the proposed refined post-processing scheme attains, when compared to conventional compression methods, up 6.33% compression gain using HEVC, and 7.78% using VVC

    Factor Graphs for Computer Vision and Image Processing

    No full text
    Factor graphs have been used extensively in the decoding of error correcting codes such as turbo codes, and in signal processing. However, while computer vision and pattern recognition are awash with graphical model usage, it is some-what surprising that factor graphs are still somewhat under-researched in these communities. This is surprising because factor graphs naturally generalise both Markov random fields and Bayesian networks. Moreover, they are useful in modelling relationships between variables that are not necessarily probabilistic and allow for efficient marginalisation via a sum-product of probabilities. In this thesis, we present and illustrate the utility of factor graphs in the vision community through some of the field’s popular problems. The thesis does so with a particular focus on maximum a posteriori (MAP) inference in graphical structures with layers. To this end, we are able to break-down complex problems into factored representations and more computationally realisable constructions. Firstly, we present a sum-product framework that uses the explicit factorisation in local subgraphs from the partitioned factor graph of a layered structure to perform inference. This provides an efficient method to perform inference since exact inference is attainable in the resulting local subtrees. Secondly, we extend this framework to the entire graphical structure without partitioning, and discuss preliminary ways to combine outputs from a multilevel construction. Lastly, we further our endeavour to combine evidence from different methods through a simplicial spanning tree reparameterisation of the factor graph in a way that ensures consistency, to produce an ensembled and improved result. Throughout the thesis, the underlying feature we make use of is to enforce adjacency constraints using Delaunay triangulations computed by adding points dynamically, or using a convex hull algorithm. The adjacency relationships from Delaunay triangulations aid the factor graph approaches in this thesis to be both efficient and competitive for computer vision tasks. This is because of the low treewidth they provide in local subgraphs, as well as the reparameterised interpretation of the graph they form through the spanning tree of simplexes. While exact inference is known to be intractable for junction trees obtained from the loopy graphs in computer vision, in this thesis we are able to effect exact inference on our spanning tree of simplexes. More importantly, the approaches presented here are not restricted to the computer vision and image processing fields, but are extendable to more general applications that involve distributed computations

    An Introduction to Neural Data Compression

    Full text link
    Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened up new possibilities for data compression, allowing compression algorithms to be learned end-to-end from data using powerful generative models such as normalizing flows, variational autoencoders, diffusion probabilistic models, and generative adversarial networks. The present article aims to introduce this field of research to a broader machine learning audience by reviewing the necessary background in information theory (e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image quality assessment, perceptual metrics), and providing a curated guide through the essential ideas and methods in the literature thus far

    Optical flow estimation via steered-L1 norm

    Get PDF
    Global variational methods for estimating optical flow are among the best performing methods due to the subpixel accuracy and the ‘fill-in’ effect they provide. The fill-in effect allows optical flow displacements to be estimated even in low and untextured areas of the image. The estimation of such displacements are induced by the smoothness term. The L1 norm provides a robust regularisation term for the optical flow energy function with a very good performance for edge-preserving. However this norm suffers from several issues, among these is the isotropic nature of this norm which reduces the fill-in effect and eventually the accuracy of estimation in areas near motion boundaries. In this paper we propose an enhancement to the L1 norm that improves the fill-in effect for this smoothness term. In order to do this we analyse the structure tensor matrix and use its eigenvectors to steer the smoothness term into components that are ‘orthogonal to’ and ‘aligned with’ image structures. This is done in primal-dual formulation. Results show a reduced end-point error and improved accuracy compared to the conventional L1 norm

    Optical flow estimation via steered-L1 norm

    Get PDF
    Global variational methods for estimating optical flow are among the best performing methods due to the subpixel accuracy and the ‘fill-in’ effect they provide. The fill-in effect allows optical flow displacements to be estimated even in low and untextured areas of the image. The estimation of such displacements are induced by the smoothness term. The L1 norm provides a robust regularisation term for the optical flow energy function with a very good performance for edge-preserving. However this norm suffers from several issues, among these is the isotropic nature of this norm which reduces the fill-in effect and eventually the accuracy of estimation in areas near motion boundaries. In this paper we propose an enhancement to the L1 norm that improves the fill-in effect for this smoothness term. In order to do this we analyse the structure tensor matrix and use its eigenvectors to steer the smoothness term into components that are ‘orthogonal to’ and ‘aligned with’ image structures. This is done in primal-dual formulation. Results show a reduced end-point error and improved accuracy compared to the conventional L1 norm

    Deep Visual Parsing with Limited Supervision

    Get PDF
    Scene parsing entails interpretation of the visual world in terms of meaningful semantic concepts. Automatically performing such analysis with machine learning techniques is not a purely scientific endeavour. It holds transformative potential for emerging technologies, such as autonomous driving and robotics, where deploying a human expert can be economically unfeasible or hazardous. Recent methods based on deep learning have made substantial progress towards realising this potential. However, to achieve high accuracy on application-specific formulations of the scene parsing task, such as semantic segmentation, deep learning models require significant amounts of high-quality dense annotation. Obtaining such supervision with human labour is costly and time-consuming. Therefore, reducing the need for precise annotation without sacrificing model accuracy is essential when it comes to deploying these models at scale. In this dissertation, we advance towards this goal by progressively reducing the amount of required supervision in the context of semantic image segmentation. In this task, we aim to label every pixel in the image with its semantic category. We formulate and implement four novel deep learning techniques operating under varying levels of task supervision: First, we develop a recurrent model for instance segmentation, which sequentially predicts one object mask at a time. Sequential models have provision for exploiting the temporal context: segmenting prominent instances first may disambiguate mask prediction for hard objects (e.g. due to occlusion) later on. However, such advantageous ordering of prediction is typically unavailable. Our proposed actor-critic framework discovers such orderings and provides empirical accuracy benefits compared to a baseline without such capacity. Second, we consider weakly supervised semantic segmentation. This problem setting requires the model to produce object masks with only image-level labels available as the training supervision. In contrast to previous works, we approach this problem with a practical single-stage model. Despite its simple design, it produces highly accurate segmentation, competitive with, or even improving upon several multi-stage methods. Reducing the amount of supervision further, we next study unsupervised domain adaptation. In this scenario, there are no labels available for real-world data. Instead, we may only use the labels of synthetically generated visual scenes. We propose a novel approach, which adapts the segmentation model trained on synthetic data to unlabelled real-world images using pseudo labels. Crucially, we construct these pseudo annotation by leveraging equivariance of the semantic segmentation task to similarity transformations. At the time of publication, our adaptation framework achieved state-of-the-art accuracy, in some benchmarks even substantially surpassing that of previous art. Last, we present an unsupervised technique for representation learning. We define the desired representation to be useful for the task of video object segmentation, which requires establishing dense object-level correspondences in video sequences. Learning such features efficiently in a fully convolutional regime is prone to degenerate solutions. Yet our approach circumvents them with a simple and effective mechanism based on the already familiar model equivariance to similarity transformations. We empirically show that our framework attains new state-of-the-art video segmentation accuracy at a significantly reduced computational cost
    corecore