435 research outputs found
Deep Bilateral Learning for Real-Time Image Enhancement
Performance is a critical challenge in mobile image processing. Given a
reference imaging pipeline, or even human-adjusted pairs of images, we seek to
reproduce the enhancements and enable real-time evaluation. For this, we
introduce a new neural network architecture inspired by bilateral grid
processing and local affine color transforms. Using pairs of input/output
images, we train a convolutional neural network to predict the coefficients of
a locally-affine model in bilateral space. Our architecture learns to make
local, global, and content-dependent decisions to approximate the desired image
transformation. At runtime, the neural network consumes a low-resolution
version of the input image, produces a set of affine transformations in
bilateral space, upsamples those transformations in an edge-preserving fashion
using a new slicing node, and then applies those upsampled transformations to
the full-resolution image. Our algorithm processes high-resolution images on a
smartphone in milliseconds, provides a real-time viewfinder at 1080p
resolution, and matches the quality of state-of-the-art approximation
techniques on a large class of image operators. Unlike previous work, our model
is trained off-line from data and therefore does not require access to the
original operator at runtime. This allows our model to learn complex,
scene-dependent transformations for which no reference implementation is
available, such as the photographic edits of a human retoucher.Comment: 12 pages, 14 figures, Siggraph 201
A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery
Semantic segmentation (classification) of Earth Observation imagery is a
crucial task in remote sensing. This paper presents a comprehensive review of
technical factors to consider when designing neural networks for this purpose.
The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), Generative Adversarial Networks (GANs), and transformer
models, discussing prominent design patterns for these ANN families and their
implications for semantic segmentation. Common pre-processing techniques for
ensuring optimal data preparation are also covered. These include methods for
image normalization and chipping, as well as strategies for addressing data
imbalance in training samples, and techniques for overcoming limited data,
including augmentation techniques, transfer learning, and domain adaptation. By
encompassing both the technical aspects of neural network design and the
data-related considerations, this review provides researchers and practitioners
with a comprehensive and up-to-date understanding of the factors involved in
designing effective neural networks for semantic segmentation of Earth
Observation imagery.Comment: 145 pages with 32 figure
Deep Representation Learning with Limited Data for Biomedical Image Synthesis, Segmentation, and Detection
Biomedical imaging requires accurate expert annotation and interpretation that can aid medical staff and clinicians in automating differential diagnosis and solving underlying health conditions. With the advent of Deep learning, it has become a standard for reaching expert-level performance in non-invasive biomedical imaging tasks by training with large image datasets. However, with the need for large publicly available datasets, training a deep learning model to learn intrinsic representations becomes harder. Representation learning with limited data has introduced new learning techniques, such as Generative Adversarial Networks, Semi-supervised Learning, and Self-supervised Learning, that can be applied to various biomedical applications. For example, ophthalmologists use color funduscopy (CF) and fluorescein angiography (FA) to diagnose retinal degenerative diseases. However, fluorescein angiography requires injecting a dye, which can create adverse reactions in the patients. So, to alleviate this, a non-invasive technique needs to be developed that can translate fluorescein angiography from fundus images. Similarly, color funduscopy and optical coherence tomography (OCT) are also utilized to semantically segment the vasculature and fluid build-up in spatial and volumetric retinal imaging, which can help with the future prognosis of diseases. Although many automated techniques have been proposed for medical image segmentation, the main drawback is the model's precision in pixel-wise predictions. Another critical challenge in the biomedical imaging field is accurately segmenting and quantifying dynamic behaviors of calcium signals in cells. Calcium imaging is a widely utilized approach to studying subcellular calcium activity and cell function; however, large datasets have yielded a profound need for fast, accurate, and standardized analyses of calcium signals. For example, image sequences from calcium signals in colonic pacemaker cells ICC (Interstitial cells of Cajal) suffer from motion artifacts and high periodic and sensor noise, making it difficult to accurately segment and quantify calcium signal events. Moreover, it is time-consuming and tedious to annotate such a large volume of calcium image stacks or videos and extract their associated spatiotemporal maps. To address these problems, we propose various deep representation learning architectures that utilize limited labels and annotations to address the critical challenges in these biomedical applications. To this end, we detail our proposed semi-supervised, generative adversarial networks and transformer-based architectures for individual learning tasks such as retinal image-to-image translation, vessel and fluid segmentation from fundus and OCT images, breast micro-mass segmentation, and sub-cellular calcium events tracking from videos and spatiotemporal map quantification. We also illustrate two multi-modal multi-task learning frameworks with applications that can be extended to other domains of biomedical applications. The main idea is to incorporate each of these as individual modules to our proposed multi-modal frameworks to solve the existing challenges with 1) Fluorescein angiography synthesis, 2) Retinal vessel and fluid segmentation, 3) Breast micro-mass segmentation, and 4) Dynamic quantification of calcium imaging datasets
HandyPose and VehiPose: Pose Estimation of Flexible and Rigid Objects
Pose estimation is an important and challenging task in computer vision. Hand pose estimation has drawn increasing attention during the past decade and has been utilized in a wide range of applications including augmented reality, virtual reality, human-computer interaction, and action recognition. Hand pose is more challenging than general human body pose estimation due to the large number of degrees of freedom and the frequent occlusions of joints. To address these challenges, we propose HandyPose, a single-pass, end-to-end trainable architecture for hand pose estimation. Adopting an encoder-decoder framework with multi-level features, our method achieves high accuracy in hand pose while maintaining manageable size complexity and modularity of the network. HandyPose takes a multi-scale approach to representing context by incorporating spatial information at various levels of the network to mitigate the loss of resolution due to pooling. Our advanced multi-level waterfall architecture leverages the efficiency of progressive cascade filtering while maintaining larger fields-of-view through the concatenation of multi-level features from different levels of the network in the waterfall module. The decoder incorporates both the waterfall and multi-scale features for the generation of accurate joint heatmaps in a single stage. Recent developments in computer vision and deep learning have achieved significant progress in human pose estimation, but little of this work has been applied to vehicle pose. We also propose VehiPose, an efficient architecture for vehicle pose estimation, based on a multi-scale deep learning approach that achieves high accuracy vehicle pose estimation while maintaining manageable network complexity and modularity. The VehiPose architecture combines an encoder-decoder architecture with a waterfall atrous convolution module for multi-scale feature representation. It incorporates contextual information across scales and performs the localization of vehicle keypoints in an end-to-end trainable network. Our HandyPose architecture has a baseline of vehipose with an improvement in performance by incorporating multi-level features from different levels of the backbone and introducing novel multi-level modules. HandyPose and VehiPose more thoroughly leverage the image contextual information and deal with the issue of spatial loss of resolution due to successive pooling while maintaining the size complexity, modularity of the network, and preserve the spatial information at various levels of the network. Our results demonstrate state-of-the-art performance on popular datasets and show that HandyPose and VehiPose are robust and efficient architectures for hand and vehicle pose estimation
Local-to-Global Information Communication for Real-Time Semantic Segmentation Network Search
Neural Architecture Search (NAS) has shown great potentials in automatically
designing neural network architectures for real-time semantic segmentation.
Unlike previous works that utilize a simplified search space with cell-sharing
way, we introduce a new search space where a lightweight model can be more
effectively searched by replacing the cell-sharing manner with cell-independent
one. Based on this, the communication of local to global information is
achieved through two well-designed modules. For local information exchange, a
graph convolutional network (GCN) guided module is seamlessly integrated as a
communication deliver between cells. For global information aggregation, we
propose a novel dense-connected fusion module (cell) which aggregates
long-range multi-level features in the network automatically. In addition, a
latency-oriented constraint is endowed into the search process to balance the
accuracy and latency. We name the proposed framework as Local-to-Global
Information Communication Network Search (LGCNet). Extensive experiments on
Cityscapes and CamVid datasets demonstrate that LGCNet achieves the new
state-of-the-art trade-off between accuracy and speed. In particular, on
Cityscapes dataset, LGCNet achieves the new best performance of 74.0\% mIoU
with the speed of 115.2 FPS on Titan Xp.Comment: arXiv admin note: text overlap with arXiv:1909.0679
Semantic Image Segmentation and Other Dense Per-Pixel Tasks: Practical Approaches
Computer vision-based and deep learning-driven applications and devices are now a part of our everyday life: from modern smartphones with an ever increasing number of cameras and other sensors to autonomous vehicles such as driverless cars and self-piloting drones. Even though a large portion of the algorithms behind those systems has been known for ages, the computational power together with the abundance of labelled data were lacking until recently. Now, following the Occam’s razor principle, we should start re-thinking those algorithms and strive towards their further simplification, both to improve our own understanding and expand the realm of their practical applications. With those goals in mind, in this work we will concentrate on a particular type of computer vision tasks that predict a certain quantity of interest for each pixel in the input image – these are so-called dense per-pixel tasks. This choice is not by chance: while there has been a huge amount of works concentrated on per-image tasks such as image classification with levels of performance reaching nearly 100%, dense per-pixel tasks bring a different set of challenges that traditionally require more computational resources and more complicated approaches. Throughout this thesis, our focus will be on reducing these computational requirements and instead presenting simple approaches to build practical vision systems that can be used in a variety of settings – e.g. indoors or outdoors, on low-resolution or high-resolution images, solving a single task or multiple tasks at once, running on modern GPU cards or on embedded devices such as Jetson TX. In the first part of the manuscript we will adapt an existing powerful but slow semantic segmentation network into a faster and competitive one through a manual re-design and analysis of its building blocks. With this approach, we will achieve nearly 3× decrease in the number of parameters and in the runtime of the network with an equally high accuracy. In the second part we then will alter this compact network in order to solve multiple dense per-pixel tasks at once, still in real-time. We will also demonstrate the value of predicting multiple quantities at once, as an example creating a 3D semantic reconstruction of the scene. In the third part, we will move away from the manual design and instead will rely on reinforcement learning to automatically traverse the search space of compact semantic segmentation architectures. While the majority of architecture search methods are computationally extremely expensive even for image classification, we will present a solution that requires only 2 generic GPU cards. Finally, in the last part we will extend our automatic architecture search solution to discover tiny but still competitive networks with less than 300K parameters taking only 1.5MB of a disk space.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
End-to-End Learning of Semantic Grid Estimation Deep Neural Network with Occupancy Grids
International audienceWe propose semantic grid, a spatial 2D map of the environment around an autonomous vehicle consisting of cells which represent the semantic information of the corresponding region such as car, road, vegetation, bikes, etc. It consists of an integration of an occupancy grid, which computes the grid states with a Bayesian filter approach, and semantic segmentation information from monocular RGB images, which is obtained with a deep neural network. The network fuses the information and can be trained in an end-to-end manner. The output of the neural network is refined with a conditional random field. The proposed method is tested in various datasets (KITTI dataset, Inria-Chroma dataset and SYNTHIA) and different deep neural network architectures are compared
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
- …