1,005 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Unsupervised crosslingual adaptation of tokenisers for spoken language recognition
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary
phonetic information. We present a study on the use of deep neural
network tokenisers. Unsupervised crosslingual adaptation was performed to
adapt the baseline tokeniser trained on English conversational telephone speech
data to different languages. Two training and adaptation approaches, namely
cross-entropy adaptation and state-level minimum Bayes risk adaptation, were
tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems
using the tokenisers adapted to different languages were combined using score
fusion, giving 7-18% reduction in minimum detection cost function (minDCF)
compared with the baseline configurations without adapted tokenisers. Analysis
of results showed that the ensemble tokenisers gave diverse representation of
phonemes, thus bringing complementary effects when SLR systems with different
tokenisers were combined. SLR performance was also shown to be related
to the quality of the adapted tokenisers
Using closely-related language to build an ASR for a very under-resourced language: Iban
International audienceThis paper describes our work on automatic speech recognition system (ASR) for an under-resourced language, Iban, a language that is mainly spoken in Sarawak, Malaysia. We collected 8 hours of data to begin this study due to no resources for ASR exist. We employed bootstrapping techniques involving a closely-related language for rapidly building and improve an Iban system. First, we used already available data from Malay, a local dominant language in Malaysia, to bootstrap grapheme-to-phoneme system (G2P) for the target language. We also built various types of G2Ps, including a grapheme-based and an English G2P, to produce different versions of dictionaries. We tested all of the dictionaries on the Iban ASR to provide us the best version. Second, we improved the baseline GMM system word error rate (WER) result by utilizing subspace Gaussian mixture models (SGMM). To test, we set two levels of data sparseness on Iban data; 7 hours and 1 hour transcribed speech. We investigated cross-lingual SGMM where the shared parameters were obtained either in monolingual or multilingual fashion and then applied to the target language for training. Experiments on out-of-language data, English and Malay, as source languages result in lower WERs when Iban data is very limited
- …