1,359 research outputs found
Learning HMM State Sequences from Phonemes for Speech Synthesis
AbstractThis paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with modified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure
Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed
The motor theory of speech perception holds that we perceive the speech of
another in terms of a motor representation of that speech. However, when we
have learned to recognize a foreign accent, it seems plausible that recognition
of a word rarely involves reconstruction of the speech gestures of the speaker
rather than the listener. To better assess the motor theory and this
observation, we proceed in three stages. Part 1 places the motor theory of
speech perception in a larger framework based on our earlier models of the
adaptive formation of mirror neurons for grasping, and for viewing extensions
of that mirror system as part of a larger system for neuro-linguistic
processing, augmented by the present consideration of recognizing speech in a
novel accent. Part 2 then offers a novel computational model of how a listener
comes to understand the speech of someone speaking the listener's native
language with a foreign accent. The core tenet of the model is that the
listener uses hypotheses about the word the speaker is currently uttering to
update probabilities linking the sound produced by the speaker to phonemes in
the native language repertoire of the listener. This, on average, improves the
recognition of later words. This model is neutral regarding the nature of the
representations it uses (motor vs. auditory). It serve as a reference point for
the discussion in Part 3, which proposes a dual-stream neuro-linguistic
architecture to revisits claims for and against the motor theory of speech
perception and the relevance of mirror neurons, and extracts some implications
for the reframing of the motor theory
End-to-End Attention-based Large Vocabulary Speech Recognition
Many of the current state-of-the-art Large Vocabulary Continuous Speech
Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov
Models (HMMs). Most of these systems contain separate components that deal with
the acoustic modelling, language modelling and sequence decoding. We
investigate a more direct approach in which the HMM is replaced with a
Recurrent Neural Network (RNN) that performs sequence prediction directly at
the character level. Alignment between the input features and the desired
character sequence is learned automatically by an attention mechanism built
into the RNN. For each predicted character, the attention mechanism scans the
input sequence and chooses relevant frames. We propose two methods to speed up
this operation: limiting the scan to a subset of most promising frames and
pooling over time the information contained in neighboring frames, thereby
reducing source sequence length. Integrating an n-gram language model into the
decoding process yields recognition accuracies similar to other HMM-free
RNN-based approaches
LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices
Recent developments in speech synthesis have produced systems capable of
outcome intelligible speech, but now researchers strive to create models that
more accurately mimic human voices. One such development is the incorporation
of multiple linguistic styles in various languages and accents.
HMM-based Speech Synthesis is of great interest to many researchers, due to
its ability to produce sophisticated features with small footprint. Despite
such progress, its quality has not yet reached the level of the predominant
unit-selection approaches that choose and concatenate recordings of real
speech. Recent efforts have been made in the direction of improving these
systems.
In this paper we present the application of Long-Short Term Memory Deep
Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order
to obtain closer spectral characteristics to those of natural speech. The
results show how HMM-voices could be improved using this approach.Comment: 5 pages, 5 figure
- …