Search CORE

18,909 research outputs found

Syllable classification using static matrices and prosodic features

Author: CUTUGNO FRANCESCO
Ludusan B.
ORIGLIA ANTONIO
Publication venue: place:Chicago
Publication date: 01/01/2010
Field of study

In this paper we explore the usefulness of prosodic features for syllable classification. In order to do this, we represent the syllable as a static analysis unit such that its acoustic-temporal dynamics could be merged into a set of features that the SVM classifier will consider as a whole. In the first part of our experiment we used MFCC as features for classification, obtaining a maximum accuracy of 86.66%. The second part of our study tests whether the prosodic information is complementary to the cepstral information for syllable classification. The results obtained show that combining the two types of information does improve the classification, but further analysis is necessary for a more successful combination of the two types of features

Archivio della ricerca - Università degli studi di Napoli Federico II

Intermediate features are not useful for tone perception

Author: Chen Y
Xu Y
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 28/05/2020
Field of study

Many theories assume that speech perception is done by first extracting features like the distinctive features, tonal features or articulatory gestures before recognizing phonetic units such as segments and tones. But it is unclear how exactly extracted features can lead to effective phonetic recognition. In this study we explore this issue by using support vector machine (SVM), a supervised machine learning model, to simulate the recognition of Mandarin tones from F0 in continuous speech. We tested how well a five-level system or a binary distinctive features system can identify Mandarin tones by training the SVM model with F0 trajectories with reduced temporal and frequency resolutions. At full resolution, the recognition rates were 97% and 86% based on the semitone and Hertz scales, respectively. At reduced temporal resolution, there was no clear decline in recognition rate until two points per syllable. At reduced frequency resolution, the recognition rate dropped rapidly: by the level with 5 bands, the accuracy was around 40% based on both Hertz and semitone scales. These results suggest that intermediate featural representations provide no benefit for tone recognition, and are unlikely to be critical for tone perception

UCL Discovery

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Phoneme Recognition Using Acoustic Events

Author: Carson-Berndsen Julie
Huebener Kai
Publication venue
Publication date: 01/01/1994
Field of study

This paper presents a new approach to phoneme recognition using nonsequential sub--phoneme units. These units are called acoustic events and are phonologically meaningful as well as recognizable from speech signals. Acoustic events form a phonologically incomplete representation as compared to distinctive features. This problem may partly be overcome by incorporating phonological constraints. Currently, 24 binary events describing manner and place of articulation, vowel quality and voicing are used to recognize all German phonemes. Phoneme recognition in this paradigm consists of two steps: After the acoustic events have been determined from the speech signal, a phonological parser is used to generate syllable and phoneme hypotheses from the event lattice. Results obtained on a speaker--dependent corpus are presented.Comment: 4 pages, to appear at ICSLP'94, PostScript version (compressed and uuencoded

arXiv.org e-Print Archive

CiteSeerX

Universaar

Acronym

Learning Latent Representations for Speech Generation and Transformation

Author: Glass James
Hsu Wei-Ning
Zhang Yu
Publication venue
Publication date: 22/09/2017
Field of study

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.Comment: Accepted to Interspeech 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Parallels in the sequential organization of birdsong and human speech.

Author: Gentner Timothy Q
Sainburg Tim
Theilman Brad
Thielk Marvin
Publication venue: eScholarship, University of California
Publication date: 01/08/2019
Field of study

Human speech possesses a rich hierarchical structure that allows for meaning to be altered by words spaced far apart in time. Conversely, the sequential structure of nonhuman communication is thought to follow non-hierarchical Markovian dynamics operating over only short distances. Here, we show that human speech and birdsong share a similar sequential structure indicative of both hierarchical and Markovian organization. We analyze the sequential dynamics of song from multiple songbird species and speech from multiple languages by modeling the information content of signals as a function of the sequential distance between vocal elements. Across short sequence-distances, an exponential decay dominates the information in speech and birdsong, consistent with underlying Markovian processes. At longer sequence-distances, the decay in information follows a power law, consistent with underlying hierarchical processes. Thus, the sequential organization of acoustic elements in two learned vocal communication signals (speech and birdsong) shows functionally equivalent dynamics, governed by similar processes

eScholarship - University of California