79 research outputs found
Deep Burst Denoising
Noise is an inherent issue of low-light image capture, one which is
exacerbated on mobile devices due to their narrow apertures and small sensors.
One strategy for mitigating noise in a low-light situation is to increase the
shutter time of the camera, thus allowing each photosite to integrate more
light and decrease noise variance. However, there are two downsides of long
exposures: (a) bright regions can exceed the sensor range, and (b) camera and
scene motion will result in blurred images. Another way of gathering more light
is to capture multiple short (thus noisy) frames in a "burst" and intelligently
integrate the content, thus avoiding the above downsides. In this paper, we use
the burst-capture strategy and implement the intelligent integration via a
recurrent fully convolutional deep neural net (CNN). We build our novel,
multiframe architecture to be a simple addition to any single frame denoising
model, and design to handle an arbitrary number of noisy input frames. We show
that it achieves state of the art denoising results on our burst dataset,
improving on the best published multi-frame techniques, such as VBM4D and
FlexISP. Finally, we explore other applications of image enhancement by
integrating content from multiple frames and demonstrate that our DNN
architecture generalizes well to image super-resolution
Massive End-to-end Models for Short Search Queries
In this work, we investigate two popular end-to-end automatic speech
recognition (ASR) models, namely Connectionist Temporal Classification (CTC)
and RNN-Transducer (RNN-T), for offline recognition of voice search queries,
with up to 2B model parameters. The encoders of our models use the neural
architecture of Google's universal speech model (USM), with additional funnel
pooling layers to significantly reduce the frame rate and speed up training and
inference. We perform extensive studies on vocabulary size, time reduction
strategy, and its generalization performance on long-form test sets. Despite
the speculation that, as the model size increases, CTC can be as good as RNN-T
which builds label dependency into the prediction, we observe that a 900M RNN-T
clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction,
although the WER gap can be largely removed by LM shallow fusion
End-to-end neural segmental models for speech recognition
Segmental models are an alternative to frame-based models for sequence
prediction, where hypothesized path weights are based on entire segment scores
rather than a single frame at a time. Neural segmental models are segmental
models that use neural network-based weight functions. Neural segmental models
have achieved competitive results for speech recognition, and their end-to-end
training has been explored in several studies. In this work, we review neural
segmental models, which can be viewed as consisting of a neural network-based
acoustic encoder and a finite-state transducer decoder. We study end-to-end
segmental models with different weight functions, including ones based on
frame-level neural classifiers and on segmental recurrent neural networks. We
study how reducing the search space size impacts performance under different
weight functions. We also compare several loss functions for end-to-end
training. Finally, we explore training approaches, including multi-stage vs.
end-to-end training and multitask training that combines segmental and
frame-level losses
- …