3,853 research outputs found
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Recently, Neural Architecture Search (NAS) has successfully identified neural
network architectures that exceed human designed ones on large-scale image
classification. In this paper, we study NAS for semantic image segmentation.
Existing works often focus on searching the repeatable cell structure, while
hand-designing the outer network structure that controls the spatial resolution
changes. This choice simplifies the search space, but becomes increasingly
problematic for dense image prediction which exhibits a lot more network level
architectural variations. Therefore, we propose to search the network level
structure in addition to the cell level structure, which forms a hierarchical
architecture search space. We present a network level search space that
includes many popular designs, and develop a formulation that allows efficient
gradient-based architecture search (3 P100 GPU days on Cityscapes images). We
demonstrate the effectiveness of the proposed method on the challenging
Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Auto-DeepLab, our
architecture searched specifically for semantic image segmentation, attains
state-of-the-art performance without any ImageNet pretraining.Comment: To appear in CVPR 2019 as oral. Code for Auto-DeepLab released at
https://github.com/tensorflow/models/tree/master/research/deepla
Real-time Ultrasound-enhanced Multimodal Imaging of Tongue using 3D Printable Stabilizer System: A Deep Learning Approach
Despite renewed awareness of the importance of articulation, it remains a
challenge for instructors to handle the pronunciation needs of language
learners. There are relatively scarce pedagogical tools for pronunciation
teaching and learning. Unlike inefficient, traditional pronunciation
instructions like listening and repeating, electronic visual feedback (EVF)
systems such as ultrasound technology have been employed in new approaches.
Recently, an ultrasound-enhanced multimodal method has been developed for
visualizing tongue movements of a language learner overlaid on the face-side of
the speaker's head. That system was evaluated for several language courses via
a blended learning paradigm at the university level. The result was asserted
that visualizing the articulator's system as biofeedback to language learners
will significantly improve articulation learning efficiency. In spite of the
successful usage of multimodal techniques for pronunciation training, it still
requires manual works and human manipulation. In this article, we aim to
contribute to this growing body of research by addressing difficulties of the
previous approaches by proposing a new comprehensive, automatic, real-time
multimodal pronunciation training system, benefits from powerful artificial
intelligence techniques. The main objective of this research was to combine the
advantages of ultrasound technology, three-dimensional printing, and deep
learning algorithms to enhance the performance of previous systems. Our
preliminary pedagogical evaluation of the proposed system revealed a
significant improvement in flexibility, control, robustness, and autonomy.Comment: 12 figures, 1 tabl
MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning
Multi-task learning is commonly used in autonomous driving for solving
various visual perception tasks. It offers significant benefits in terms of
both performance and computational complexity. Current work on multi-task
learning networks focus on processing a single input image and there is no
known implementation of multi-task learning handling a sequence of images. In
this work, we propose a multi-stream multi-task network to take advantage of
using feature representations from preceding frames in a video sequence for
joint learning of segmentation, depth, and motion. The weights of the current
and previous encoder are shared so that features computed in the previous frame
can be leveraged without additional computation. In addition, we propose to use
the geometric mean of task losses as a better alternative to the weighted
average of task losses. The proposed loss function facilitates better handling
of the difference in convergence rates of different tasks. Experimental results
on KITTI, Cityscapes and SYNTHIA datasets demonstrate that the proposed
strategies outperform various existing multi-task learning solutions.Comment: Accepted for CVPR 2019 Workshop on Autonomous Driving (WAD). Demo
Video can be accessed at https://youtu.be/E378PzLq7l
Online Normalization for Training Neural Networks
Online Normalization is a new technique for normalizing the hidden
activations of a neural network. Like Batch Normalization, it normalizes the
sample dimension. While Online Normalization does not use batches, it is as
accurate as Batch Normalization. We resolve a theoretical limitation of Batch
Normalization by introducing an unbiased technique for computing the gradient
of normalized activations. Online Normalization works with automatic
differentiation by adding statistical normalization as a primitive. This
technique can be used in cases not covered by some other normalizers, such as
recurrent networks, fully connected networks, and networks with activation
memory requirements prohibitive for batching. We show its applications to image
classification, image segmentation, and language modeling. We present formal
proofs and experimental results on ImageNet, CIFAR, and PTB datasets.Comment: Published at the Conference on Neural Information Processing Systems
(NeurIPS 2019), Vancouver, Canada. Code:
https://github.com/Cerebras/online-normalizatio
Identity-Preserving Realistic Talking Face Generation
Speech-driven facial animation is useful for a variety of applications such
as telepresence, chatbots, etc. The necessary attributes of having a realistic
face animation are 1) audio-visual synchronization (2) identity preservation of
the target individual (3) plausible mouth movements (4) presence of natural eye
blinks. The existing methods mostly address the audio-visual lip
synchronization, and few recent works have addressed the synthesis of natural
eye blinks for overall video realism. In this paper, we propose a method for
identity-preserving realistic facial animation from speech. We first generate
person-independent facial landmarks from audio using DeepSpeech features for
invariance to different voices, accents, etc. To add realism, we impose eye
blinks on facial landmarks using unsupervised learning and retargets the
person-independent landmarks to person-specific landmarks to preserve the
identity-related facial structure which helps in the generation of plausible
mouth shapes of the target identity. Finally, we use LSGAN to generate the
facial texture from person-specific facial landmarks, using an attention
mechanism that helps to preserve identity-related texture. An extensive
comparison of our proposed method with the current state-of-the-art methods
demonstrates a significant improvement in terms of lip synchronization
accuracy, image reconstruction quality, sharpness, and identity-preservation. A
user study also reveals improved realism of our animation results over the
state-of-the-art methods. To the best of our knowledge, this is the first work
in speech-driven 2D facial animation that simultaneously addresses all the
above-mentioned attributes of a realistic speech-driven face animation.Comment: Accepted in IJCNN 202
Left Ventricle Segmentation and Quantification from Cardiac Cine MR Images via Multi-task Learning
Segmentation of the left ventricle and quantification of various cardiac
contractile functions is crucial for the timely diagnosis and treatment of
cardiovascular diseases. Traditionally, the two tasks have been tackled
independently. Here we propose a convolutional neural network based multi-task
learning approach to perform both tasks simultaneously, such that, the network
learns better representation of the data with improved generalization
performance. Probabilistic formulation of the problem enables learning the task
uncertainties during the training, which are used to automatically compute the
weights for the tasks. We performed a five fold cross-validation of the
myocardium segmentation obtained from the proposed multi-task network on 97
patient 4-dimensional cardiac cine-MRI datasets available through the STACOM LV
segmentation challenge against the provided gold-standard myocardium
segmentation, obtaining a Dice overlap of and mean surface
distance of mm, while simultaneously estimating the
myocardial area with mean absolute difference error of mm.Comment: STACOM 2018 Workshop, MICCAI 201
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
The application of deep convolutional neural networks to ultrasound for modelling of dynamic states within human skeletal muscle
This paper concerns the fully automatic direct in vivo measurement of active
and passive dynamic skeletal muscle states using ultrasound imaging. Despite
the long standing medical need (myopathies, neuropathies, pain, injury,
ageing), currently technology (electromyography, dynamometry, shear wave
imaging) provides no general, non-invasive method for online estimation of
skeletal intramuscular states. Ultrasound provides a technology in which static
and dynamic muscle states can be observed non-invasively, yet current
computational image understanding approaches are inadequate. We propose a new
approach in which deep learning methods are used for understanding the content
of ultrasound images of muscle in terms of its measured state. Ultrasound data
synchronized with electromyography of the calf muscles, with measures of joint
torque/angle were recorded from 19 healthy participants (6 female, ages: 30 +-
7.7). A segmentation algorithm previously developed by our group was applied to
extract a region of interest of the medial gastrocnemius. Then a deep
convolutional neural network was trained to predict the measured states (joint
angle/torque, electromyography) directly from the segmented images. Results
revealed for the first time that active and passive muscle states can be
measured directly from standard b-mode ultrasound images, accurately predicting
for a held out test participant changes in the joint angle, electromyography,
and torque with as little error as 0.022{\deg}, 0.0001V, 0.256Nm (root mean
square error) respectively.Comment: paper in preparation for submission to IEEE TM
Point Cloud Oversegmentation with Graph-Structured Deep Metric Learning
We propose a new supervized learning framework for oversegmenting 3D point
clouds into superpoints. We cast this problem as learning deep embeddings of
the local geometry and radiometry of 3D points, such that the border of objects
presents high contrasts. The embeddings are computed using a lightweight neural
network operating on the points' local neighborhood. Finally, we formulate
point cloud oversegmentation as a graph partition problem with respect to the
learned embeddings.
This new approach allows us to set a new state-of-the-art in point cloud
oversegmentation by a significant margin, on a dense indoor dataset (S3DIS) and
a sparse outdoor one (vKITTI). Our best solution requires over five times fewer
superpoints to reach similar performance than previously published methods on
S3DIS. Furthermore, we show that our framework can be used to improve
superpoint-based semantic segmentation algorithms, setting a new
state-of-the-art for this task as well.Comment: CVPR201
A Gentle Introduction to Deep Learning in Medical Image Processing
This paper tries to give a gentle introduction to deep learning in medical
image processing, proceeding from theoretical foundations to applications. We
first discuss general reasons for the popularity of deep learning, including
several major breakthroughs in computer science. Next, we start reviewing the
fundamental basics of the perceptron and neural networks, along with some
fundamental theory that is often omitted. Doing so allows us to understand the
reasons for the rise of deep learning in many application domains. Obviously
medical image processing is one of these areas which has been largely affected
by this rapid progress, in particular in image detection and recognition, image
segmentation, image registration, and computer-aided diagnosis. There are also
recent trends in physical simulation, modelling, and reconstruction that have
led to astonishing results. Yet, some of these approaches neglect prior
knowledge and hence bear the risk of producing implausible results. These
apparent weaknesses highlight current limitations of deep learning. However, we
also briefly discuss promising approaches that might be able to resolve these
problems in the future.Comment: Accepted by Journal of Medical Physics; Final Version after revie
- …