76 research outputs found
Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer
Semantic annotations are vital for training models for object recognition,
semantic segmentation or scene understanding. Unfortunately, pixelwise
annotation of images at very large scale is labor-intensive and only little
labeled data is available, particularly at instance level and for street
scenes. In this paper, we propose to tackle this problem by lifting the
semantic instance labeling task from 2D into 3D. Given reconstructions from
stereo or laser data, we annotate static 3D scene elements with rough bounding
primitives and develop a model which transfers this information into the image
domain. We leverage our method to obtain 2D labels for a novel suburban video
dataset which we have collected, resulting in 400k semantic and instance image
annotations. A comparison of our method to state-of-the-art label transfer
baselines reveals that 3D information enables more efficient annotation while
at the same time resulting in improved accuracy and time-coherent labels.Comment: 10 pages in Conference on Computer Vision and Pattern Recognition
(CVPR), 201
Scalable light field representation and coding
This Thesis aims to advance the state-of-the-art in light field representation and coding. In this context, proposals to improve functionalities like light field random access and scalability are also presented. As the light field representation constrains the coding approach to be used, several light field coding techniques to exploit the inherent characteristics of the most popular types of light field representations are proposed and studied, which are normally based on micro-images or sub-aperture-images.
To encode micro-images, two solutions are proposed, aiming to exploit the redundancy between neighboring micro-images using a high order prediction model, where the model parameters are either explicitly transmitted or inferred at the decoder, respectively. In both cases, the proposed solutions are able to outperform low order prediction solutions.
To encode sub-aperture-images, an HEVC-based solution that exploits their inherent intra and inter redundancies is proposed. In this case, the light field image is encoded as a pseudo video sequence, where the scanning order is signaled, allowing the encoder and decoder to optimize the reference picture lists to improve coding efficiency.
A novel hybrid light field representation coding approach is also proposed, by exploiting the combined use of both micro-image and sub-aperture-image representation types, instead of using each representation individually.
In order to aid the fast deployment of the light field technology, this Thesis also proposes scalable coding and representation approaches that enable adequate compatibility with legacy displays (e.g., 2D, stereoscopic or multiview) and with future light field displays, while maintaining high coding efficiency. Additionally, viewpoint random access, allowing to improve the light field navigation and to reduce the decoding delay, is also enabled with a flexible trade-off between coding efficiency and viewpoint random access.Esta Tese tem como objetivo avançar o estado da arte em representação e codificação de campos de luz. Neste contexto, são também apresentadas propostas para melhorar funcionalidades como o acesso aleatório ao campo de luz e a escalabilidade. Como a representação do campo de luz limita a abordagem de codificação a ser utilizada, são propostas e estudadas várias técnicas de codificação de campos de luz para explorar as características inerentes aos seus tipos mais populares de representação, que são normalmente baseadas em micro-imagens ou imagens de sub-abertura.
Para codificar as micro-imagens, são propostas duas soluções, visando explorar a redundância entre micro-imagens vizinhas utilizando um modelo de predição de alta ordem, onde os parâmetros do modelo são explicitamente transmitidos ou inferidos no decodificador, respetivamente. Em ambos os casos, as soluções propostas são capazes de superar as soluções de predição de baixa ordem.
Para codificar imagens de sub-abertura, é proposta uma solução baseada em HEVC que explora a inerente redundância intra e inter deste tipo de imagens. Neste caso, a imagem do campo de luz é codificada como uma pseudo-sequência de vídeo, onde a ordem de varrimento é sinalizada, permitindo ao codificador e decodificador otimizar as listas de imagens de referência para melhorar a eficiência da codificação.
Também é proposta uma nova abordagem de codificação baseada na representação híbrida do campo de luz, explorando o uso combinado dos tipos de representação de micro-imagem e sub-imagem, em vez de usar cada representação individualmente.
A fim de facilitar a rápida implantação da tecnologia de campo de luz, esta Tese também propõe abordagens escaláveis de codificação e representação que permitem uma compatibilidade adequada com monitores tradicionais (e.g., 2D, estereoscópicos ou multivista) e com futuros monitores de campo de luz, mantendo ao mesmo tempo uma alta eficiência de codificação. Além disso, o acesso aleatório de pontos de vista, permitindo melhorar a navegação no campo de luz e reduzir o atraso na descodificação, também é permitido com um equilíbrio flexível entre eficiência de codificação e acesso aleatório de pontos de vista
Rich probabilistic models for semantic labeling
Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung
JOINT CODING OF MULTIMODAL BIOMEDICAL IMAGES US ING CONVOLUTIONAL NEURAL NETWORKS
The massive volume of data generated daily by the gathering of medical images with
different modalities might be difficult to store in medical facilities and share through
communication networks. To alleviate this issue, efficient compression methods
must be implemented to reduce the amount of storage and transmission resources
required in such applications. However, since the preservation of all image details
is highly important in the medical context, the use of lossless image compression
algorithms is of utmost importance.
This thesis presents the research results on a lossless compression scheme designed
to encode both computerized tomography (CT) and positron emission tomography
(PET). Different techniques, such as image-to-image translation, intra prediction,
and inter prediction are used. Redundancies between both image modalities are
also investigated. To perform the image-to-image translation approach, we resort to
lossless compression of the original CT data and apply a cross-modality image translation
generative adversarial network to obtain an estimation of the corresponding
PET.
Two approaches were implemented and evaluated to determine a PET residue
that will be compressed along with the original CT. In the first method, the
residue resulting from the differences between the original PET and its estimation
is encoded, whereas in the second method, the residue is obtained using encoders
inter-prediction coding tools. Thus, in alternative to compressing two independent
picture modalities, i.e., both images of the original PET-CT pair solely the CT is
independently encoded alongside with the PET residue, in the proposed method.
Along with the proposed pipeline, a post-processing optimization algorithm that
modifies the estimated PET image by altering the contrast and rescaling the image
is implemented to maximize the compression efficiency.
Four different versions (subsets) of a publicly available PET-CT pair dataset
were tested. The first proposed subset was used to demonstrate that the concept
developed in this work is capable of surpassing the traditional compression schemes.
The obtained results showed gains of up to 8.9% using the HEVC. On the other
side, JPEG2k proved not to be the most suitable as it failed to obtain good results,
having reached only -9.1% compression gain. For the remaining (more challenging) subsets, the results reveal that the proposed refined post-processing scheme attains,
when compared to conventional compression methods, up 6.33% compression gain
using HEVC, and 7.78% using VVC
Factor Graphs for Computer Vision and Image Processing
Factor graphs have been used extensively in the decoding of error
correcting codes such as turbo codes, and in signal processing.
However, while computer vision and pattern recognition are awash
with graphical model usage, it is some-what surprising that
factor graphs are still somewhat under-researched in these
communities. This is surprising because factor graphs naturally
generalise both Markov random fields and Bayesian networks.
Moreover, they are useful in modelling relationships between
variables that are not necessarily probabilistic and allow for
efficient marginalisation via a sum-product of probabilities.
In this thesis, we present and illustrate the utility of factor
graphs in the vision community through some of the field’s
popular problems. The thesis does so with a particular focus on
maximum a posteriori (MAP) inference in graphical
structures with layers. To this end, we are able to break-down
complex problems into factored representations and more
computationally realisable constructions. Firstly, we present a
sum-product framework that uses the explicit factorisation
in local subgraphs from the partitioned factor graph of a layered
structure to perform inference. This provides an efficient method
to perform inference since exact inference is attainable in the
resulting local subtrees. Secondly, we extend this framework to
the entire graphical structure without partitioning, and discuss
preliminary ways to combine outputs from a multilevel
construction. Lastly, we further our endeavour to combine
evidence from different methods through
a simplicial spanning tree reparameterisation of the factor graph
in a way that ensures consistency, to produce an ensembled and
improved result. Throughout the thesis, the underlying feature we
make use of is to enforce adjacency constraints using Delaunay
triangulations computed by adding points dynamically, or using a
convex hull algorithm. The adjacency relationships from Delaunay
triangulations aid the factor graph approaches in this thesis to
be both efficient and
competitive for computer vision tasks. This is because of the low
treewidth they provide in local subgraphs, as well as the
reparameterised interpretation of the graph they form through the
spanning tree of simplexes. While exact inference is known to be
intractable for junction trees obtained from the loopy graphs in
computer vision, in this thesis we are able to effect exact
inference on our spanning tree of simplexes. More importantly,
the approaches presented here are not restricted to the computer
vision and image processing fields, but are extendable to more
general applications that involve distributed computations
An Introduction to Neural Data Compression
Neural compression is the application of neural networks and other machine
learning methods to data compression. Recent advances in statistical machine
learning have opened up new possibilities for data compression, allowing
compression algorithms to be learned end-to-end from data using powerful
generative models such as normalizing flows, variational autoencoders,
diffusion probabilistic models, and generative adversarial networks. The
present article aims to introduce this field of research to a broader machine
learning audience by reviewing the necessary background in information theory
(e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image
quality assessment, perceptual metrics), and providing a curated guide through
the essential ideas and methods in the literature thus far
Optical flow estimation via steered-L1 norm
Global variational methods for estimating optical flow are among the best performing methods due to the subpixel accuracy and the ‘fill-in’ effect they provide. The fill-in effect allows optical flow displacements to be estimated even in low and untextured areas of the image. The estimation of such displacements are induced by the smoothness term. The L1 norm provides a robust regularisation term for the optical flow energy function with a very good performance for edge-preserving. However this norm suffers from several issues, among these is the isotropic nature of this norm which reduces the fill-in effect and eventually the accuracy of estimation in areas near motion boundaries. In this paper we propose an enhancement to the L1 norm that improves the fill-in effect for this smoothness term. In order to do this we analyse the structure tensor matrix and use its eigenvectors to steer the smoothness term into components that are ‘orthogonal to’ and ‘aligned with’ image structures. This is done in primal-dual formulation. Results show a reduced end-point error and improved accuracy compared to the conventional L1 norm
Optical flow estimation via steered-L1 norm
Global variational methods for estimating optical flow are among the best performing methods due to the subpixel accuracy and the ‘fill-in’ effect they provide. The fill-in effect allows optical flow displacements to be estimated even in low and untextured areas of the image. The estimation of such displacements are induced by the smoothness term. The L1 norm provides a robust regularisation term for the optical flow energy function with a very good performance for edge-preserving. However this norm suffers from several issues, among these is the isotropic nature of this norm which reduces the fill-in effect and eventually the accuracy of estimation in areas near motion boundaries. In this paper we propose an enhancement to the L1 norm that improves the fill-in effect for this smoothness term. In order to do this we analyse the structure tensor matrix and use its eigenvectors to steer the smoothness term into components that are ‘orthogonal to’ and ‘aligned with’ image structures. This is done in primal-dual formulation. Results show a reduced end-point error and improved accuracy compared to the conventional L1 norm
Deep Visual Parsing with Limited Supervision
Scene parsing entails interpretation of the visual world in terms of meaningful semantic concepts. Automatically performing such analysis with machine learning techniques is not a purely scientific endeavour. It holds transformative potential for emerging technologies, such as autonomous driving and robotics, where deploying a human expert can be economically unfeasible or hazardous. Recent methods based on deep learning have made substantial progress towards realising this potential. However, to achieve high accuracy on application-specific formulations of the scene parsing task, such as semantic segmentation, deep learning models require significant amounts of high-quality dense annotation. Obtaining such supervision with human labour is costly and time-consuming. Therefore, reducing the need for precise annotation without sacrificing model accuracy is essential when it comes to deploying these models at scale.
In this dissertation, we advance towards this goal by progressively reducing the amount of required supervision in the context of semantic image segmentation. In this task, we aim to label every pixel in the image with its semantic category. We formulate and implement four novel deep learning techniques operating under varying levels of task supervision:
First, we develop a recurrent model for instance segmentation, which sequentially predicts one object mask at a time. Sequential models have provision for exploiting the temporal context: segmenting prominent instances first may disambiguate mask prediction for hard objects (e.g. due to occlusion) later on. However, such advantageous ordering of prediction is typically unavailable. Our proposed actor-critic framework discovers such orderings and provides empirical accuracy benefits compared to a baseline without such capacity.
Second, we consider weakly supervised semantic segmentation. This problem setting requires the model to produce object masks with only image-level labels available as the training supervision. In contrast to previous works, we approach this problem with a practical single-stage model. Despite its simple design, it produces highly accurate segmentation, competitive with, or even improving upon several multi-stage methods.
Reducing the amount of supervision further, we next study unsupervised domain adaptation. In this scenario, there are no labels available for real-world data. Instead, we may only use the labels of synthetically generated visual scenes. We propose a novel approach, which adapts the segmentation model trained on synthetic data to unlabelled real-world images using pseudo labels.
Crucially, we construct these pseudo annotation by leveraging equivariance of the semantic segmentation task to similarity transformations. At the time of publication, our adaptation framework achieved state-of-the-art accuracy, in some benchmarks even substantially surpassing that of previous art.
Last, we present an unsupervised technique for representation learning. We define the desired representation to be useful for the task of video object segmentation, which requires establishing dense object-level correspondences in video sequences.
Learning such features efficiently in a fully convolutional regime is prone to degenerate solutions. Yet our approach circumvents them with a simple and effective mechanism based on the already familiar model equivariance to similarity transformations.
We empirically show that our framework attains new state-of-the-art video segmentation accuracy at a significantly reduced computational cost
- …