3,433 research outputs found
High frame-rate cardiac ultrasound imaging with deep learning
Cardiac ultrasound imaging requires a high frame rate in order to capture
rapid motion. This can be achieved by multi-line acquisition (MLA), where
several narrow-focused received lines are obtained from each wide-focused
transmitted line. This shortens the acquisition time at the expense of
introducing block artifacts. In this paper, we propose a data-driven
learning-based approach to improve the MLA image quality. We train an
end-to-end convolutional neural network on pairs of real ultrasound cardiac
data, acquired through MLA and the corresponding single-line acquisition (SLA).
The network achieves a significant improvement in image quality for both
and line MLA resulting in a decorrelation measure similar to that of SLA
while having the frame rate of MLA.Comment: To appear in the Proceedings of MICCAI, 201
Improving the Performance of Online Neural Transducer Models
Having a sequence-to-sequence model which can operate in an online fashion is
important for streaming applications such as Voice Search. Neural transducer is
a streaming sequence-to-sequence model, but has shown a significant degradation
in performance compared to non-streaming models such as Listen, Attend and
Spell (LAS). In this paper, we present various improvements to NT.
Specifically, we look at increasing the window over which NT computes
attention, mainly by looking backwards in time so the model still remains
online. In addition, we explore initializing a NT model from a LAS-trained
model so that it is guided with a better alignment. Finally, we explore
including stronger language models such as using wordpiece models, and applying
an external LM during the beam search. On a Voice Search task, we find with
these improvements we can get NT to match the performance of LAS
Exact Hard Monotonic Attention for Character-Level Transduction
Many common character-level, string-to string transduction tasks, e.g.,
grapheme-tophoneme conversion and morphological inflection, consist almost
exclusively of monotonic transductions. However, neural sequence-to sequence
models that use non-monotonic soft attention often outperform popular monotonic
models. In this work, we ask the following question: Is monotonicity really a
helpful inductive bias for these tasks? We develop a hard attention
sequence-to-sequence model that enforces strict monotonicity and learns a
latent alignment jointly while learning to transduce. With the help of dynamic
programming, we are able to compute the exact marginalization over all
monotonic alignments. Our models achieve state-of-the-art performance on
morphological inflection. Furthermore, we find strong performance on two other
character-level transduction tasks. Code is available at
https://github.com/shijie-wu/neural-transducer.Comment: ACL 201
Efficient Training of Neural Transducer for Speech Recognition
As one of the most popular sequence-to-sequence modeling approaches for
speech recognition, the RNN-Transducer has achieved evolving performance with
more and more sophisticated neural network models of growing size and
increasing training epochs. While strong computation resources seem to be the
prerequisite of training superior models, we try to overcome it by carefully
designing a more efficient training pipeline. In this work, we propose an
efficient 3-stage progressive training pipeline to build highly-performing
neural transducer models from scratch with very limited computation resources
in a reasonable short time period. The effectiveness of each stage is
experimentally verified on both Librispeech and Switchboard corpora. The
proposed pipeline is able to train transducer models approaching
state-of-the-art performance with a single GPU in just 2-3 weeks. Our best
conformer transducer achieves 4.1% WER on Librispeech test-other with only 35
epochs of training.Comment: accepted at Interspeech 202
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
- …