18,909 research outputs found
Syllable classification using static matrices and prosodic features
In this paper we explore the usefulness of prosodic features for
syllable classification. In order to do this, we represent the
syllable as a static analysis unit such that its acoustic-temporal
dynamics could be merged into a set of features that the SVM
classifier will consider as a whole. In the first part of our
experiment we used MFCC as features for classification,
obtaining a maximum accuracy of 86.66%. The second part of
our study tests whether the prosodic information is
complementary to the cepstral information for syllable
classification. The results obtained show that combining the
two types of information does improve the classification, but
further analysis is necessary for a more successful
combination of the two types of features
Intermediate features are not useful for tone perception
Many theories assume that speech perception is done by first extracting features like the distinctive features, tonal features or articulatory gestures before recognizing phonetic units such as segments and tones. But it is unclear how exactly extracted features can lead to effective phonetic recognition. In this study we explore this issue by using support vector machine (SVM), a supervised machine learning model, to simulate the recognition of Mandarin tones from F0 in continuous speech. We tested how well a five-level system or a binary distinctive features system can identify Mandarin tones by training the SVM model with F0 trajectories with reduced temporal and frequency resolutions. At full resolution, the recognition rates were 97% and 86% based on the semitone and Hertz scales, respectively. At reduced temporal resolution, there was no clear decline in recognition rate until two points per syllable. At reduced frequency resolution, the recognition rate dropped rapidly: by the level with 5 bands, the accuracy was around 40% based on both Hertz and semitone scales. These results suggest that intermediate featural representations provide no benefit for tone recognition, and are unlikely to be critical for tone perception
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Phoneme Recognition Using Acoustic Events
This paper presents a new approach to phoneme recognition using nonsequential
sub--phoneme units. These units are called acoustic events and are
phonologically meaningful as well as recognizable from speech signals. Acoustic
events form a phonologically incomplete representation as compared to
distinctive features. This problem may partly be overcome by incorporating
phonological constraints. Currently, 24 binary events describing manner and
place of articulation, vowel quality and voicing are used to recognize all
German phonemes. Phoneme recognition in this paradigm consists of two steps:
After the acoustic events have been determined from the speech signal, a
phonological parser is used to generate syllable and phoneme hypotheses from
the event lattice. Results obtained on a speaker--dependent corpus are
presented.Comment: 4 pages, to appear at ICSLP'94, PostScript version (compressed and
uuencoded
Learning Latent Representations for Speech Generation and Transformation
An ability to model a generative process and learn a latent representation
for speech in an unsupervised fashion will be crucial to process vast
quantities of unlabelled speech data. Recently, deep probabilistic generative
models such as Variational Autoencoders (VAEs) have achieved tremendous success
in modeling natural images. In this paper, we apply a convolutional VAE to
model the generative process of natural speech. We derive latent space
arithmetic operations to disentangle learned latent representations. We
demonstrate the capability of our model to modify the phonetic content or the
speaker identity for speech segments using the derived operations, without the
need for parallel supervisory data.Comment: Accepted to Interspeech 201
Recommended from our members
Parallels in the sequential organization of birdsong and human speech.
Human speech possesses a rich hierarchical structure that allows for meaning to be altered by words spaced far apart in time. Conversely, the sequential structure of nonhuman communication is thought to follow non-hierarchical Markovian dynamics operating over only short distances. Here, we show that human speech and birdsong share a similar sequential structure indicative of both hierarchical and Markovian organization. We analyze the sequential dynamics of song from multiple songbird species and speech from multiple languages by modeling the information content of signals as a function of the sequential distance between vocal elements. Across short sequence-distances, an exponential decay dominates the information in speech and birdsong, consistent with underlying Markovian processes. At longer sequence-distances, the decay in information follows a power law, consistent with underlying hierarchical processes. Thus, the sequential organization of acoustic elements in two learned vocal communication signals (speech and birdsong) shows functionally equivalent dynamics, governed by similar processes
- …