10,626 research outputs found
End-to-End Speech Recognition From the Raw Waveform
State-of-the-art speech recognition systems rely on fixed, hand-crafted
features such as mel-filterbanks to preprocess the waveform before the training
pipeline. In this paper, we study end-to-end systems trained directly from the
raw waveform, building on two alternatives for trainable replacements of
mel-filterbanks that use a convolutional architecture. The first one is
inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015),
and the second one by the scattering transform (Zeghidour et al., 2017). We
propose two modifications to these architectures and systematically compare
them to mel-filterbanks, on the Wall Street Journal dataset. The first
modification is the addition of an instance normalization layer, which greatly
improves on the gammatone-based trainable filterbanks and speeds up the
training of the scattering-based filterbanks. The second one relates to the
low-pass filter used in these approaches. These modifications consistently
improve performances for both approaches, and remove the need for a careful
initialization in scattering-based trainable filterbanks. In particular, we
show a consistent improvement in word error rate of the trainable filterbanks
relatively to comparable mel-filterbanks. It is the first time end-to-end
models trained from the raw signal significantly outperform mel-filterbanks on
a large vocabulary task under clean recording conditions.Comment: Accepted for presentation at Interspeech 201
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
RawNet: Fast End-to-End Neural Vocoder
Neural networks based vocoders have recently demonstrated the powerful
ability to synthesize high quality speech. These models usually generate
samples by conditioning on some spectrum features, such as Mel-spectrum.
However, these features are extracted by using speech analysis module including
some processing based on the human knowledge. In this work, we proposed RawNet,
a truly end-to-end neural vocoder, which use a coder network to learn the
higher representation of signal, and an autoregressive voder network to
generate speech sample by sample. The coder and voder together act like an
auto-encoder network, and could be jointly trained directly on raw waveform
without any human-designed features. The experiments on the Copy-Synthesis
tasks show that RawNet can achieve the comparative synthesized speech quality
with LPCNet, with a smaller model architecture and faster speech generation at
the inference step.Comment: Submitted to Interspeech 2019, Graz, Austri
- …