2,158 research outputs found
Revisiting the Status of Speech Rhythm
Text-to-Speech synthesis offers an interesting manner of synthesising various knowledge components related to speech production. To a certain extent, it provides a new way of testing the coherence of our understanding of speech production in a highly systematic manner. For example, speech rhythm and temporal organisation of speech have to be well-captured in order to mimic a speaker correctly.
The simulation approach used in our laboratory for two languages supports our original hypothesis of multidimensionality and non-linearity in the production of speech rhythm. This paper presents an overview of our approach towards this issue, as it has been developed over the last years.
We conceive the production of speech rhythm as a multidimensional task, and the temporal organisation of speech as a key component of this task (i.e., the establishment of temporal boundaries and durations). As a result of this multidimensionality, text-to-speech systems have to accommodate a number of systematic transformations and computations at various levels. Our model of the temporal organisation of read speech in French and German emerges from a combination of quantitative and qualitative parameters, organised according to psycholinguistic and linguistic structures. (An ideal speech synthesiser would also take into account subphonemic as well as pragmatic parameters. However such systems are not yet available)
Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation
Joint segmentation and classification of fine-grained actions is important
for applications of human-robot interaction, video surveillance, and human
skill evaluation. However, despite substantial recent progress in large-scale
action classification, the performance of state-of-the-art fine-grained action
recognition approaches remains low. We propose a model for action segmentation
which combines low-level spatiotemporal features with a high-level segmental
classifier. Our spatiotemporal CNN is comprised of a spatial component that
uses convolutional filters to capture information about objects and their
relationships, and a temporal component that uses large 1D convolutional
filters to capture information about how object relationships change across
time. These features are used in tandem with a semi-Markov model that models
transitions from one action to another. We introduce an efficient constrained
segmental inference algorithm for this model that is orders of magnitude faster
than the current approach. We highlight the effectiveness of our Segmental
Spatiotemporal CNN on cooking and surgical action datasets for which we observe
substantially improved performance relative to recent baseline methods.Comment: Updated from the ECCV 2016 version. We fixed an important
mathematical error and made the section on segmental inference cleare
The Resonant Dynamics of Speech Perception: Interword Integration and Duration-Dependent Backward Effects
How do listeners integrate temporally distributed phonemic information into coherent representations of syllables and words? During fluent speech perception, variations in the durations of speech sounds and silent pauses can produce different pereeived groupings. For exarnple, increasing the silence interval between the words "gray chip" may result in the percept "great chip", whereas increasing the duration of fricative noise in "chip" may alter the percept to "great ship" (Repp et al., 1978). The ARTWORD neural model quantitatively simulates such context-sensitive speech data. In AHTWORD, sequential activation and storage of phonemic items in working memory provides bottom-up input to unitized representations, or list chunks, that group together sequences of items of variable length. The list chunks compete with each other as they dynamically integrate this bottom-up information. The winning groupings feed back to provide top-down supportto their phonemic items. Feedback establishes a resonance which temporarily boosts the activation levels of selected items and chunks, thereby creating an emergent conscious percept. Because the resonance evolves more slowly than wotking memory activation, it can be influenced by information presented after relatively long intervening silence intervals. The same phonemic input can hereby yield different groupings depending on its arrival time. Processes of resonant transfer and competitive teaming help determine which groupings win the competition. Habituating levels of neurotransmitter along the pathways that sustain the resonant feedback lead to a resonant collapsee that permits the formation of subsequent. resonances.Air Force Office of Scientific Research (F49620-92-J-0225); Defense Advanced Research projects Agency and Office of Naval Research (N00014-95-1-0409); National Science Foundation (IRI-97-20333); Office of Naval Research (N00014-92-J-1309, NOOO14-95-1-0657
Use of phoneme dedicated artificial neural networks to predict segmental durations
The results of two alternative models to predict segmental durations in speech synthesis, both based on Artificial Neural Networks (ANNs) are discussed. The ANN model consists in just one ANN trained to predict the segmental durations for all phonemes. The phoneme dedicated ANN model consists in a set of ANNs, each one dedicated to predict the segmental duration of a specific phoneme. Both models are compared with the same input information extracted from one European Portuguese database. Objective and subjective measurements of performance of both approaches are compared. A slight preference was denoted for the phoneme dedicated ANN model
A Dynamic Approach to Rhythm in Language: Toward a Temporal Phonology
It is proposed that the theory of dynamical systems offers appropriate tools
to model many phonological aspects of both speech production and perception. A
dynamic account of speech rhythm is shown to be useful for description of both
Japanese mora timing and English timing in a phrase repetition task. This
orientation contrasts fundamentally with the more familiar symbolic approach to
phonology, in which time is modeled only with sequentially arrayed symbols. It
is proposed that an adaptive oscillator offers a useful model for perceptual
entrainment (or `locking in') to the temporal patterns of speech production.
This helps to explain why speech is often perceived to be more regular than
experimental measurements seem to justify. Because dynamic models deal with
real time, they also help us understand how languages can differ in their
temporal detail---contributing to foreign accents, for example. The fact that
languages differ greatly in their temporal detail suggests that these effects
are not mere motor universals, but that dynamical models are intrinsic
components of the phonological characterization of language.Comment: 31 pages; compressed, uuencoded Postscrip
Automatic Measurement of Pre-aspiration
Pre-aspiration is defined as the period of glottal friction occurring in
sequences of vocalic/consonantal sonorants and phonetically voiceless
obstruents. We propose two machine learning methods for automatic measurement
of pre-aspiration duration: a feedforward neural network, which works at the
frame level; and a structured prediction model, which relies on manually
designed feature functions, and works at the segment level. The input for both
algorithms is a speech signal of an arbitrary length containing a single
obstruent, and the output is a pair of times which constitutes the
pre-aspiration boundaries. We train both models on a set of manually annotated
examples. Results suggest that the structured model is superior to the
frame-based model as it yields higher accuracy in predicting the boundaries and
generalizes to new speakers and new languages. Finally, we demonstrate the
applicability of our structured prediction algorithm by replicating linguistic
analysis of pre-aspiration in Aberystwyth English with high correlation
Phoneme dedicated ANN improves segmental duration model
The Phoneme Dedicated Artificial Neural Network (PDANN) segmental duration model consists of a set of ANNs trained specifically for each phoneme segment in order to avoid miscellaneous influence of different types of phoneme segments. Therefore, each ANN is dedicated to predict the duration of a specific phoneme segment. Objective and subjective measurements of the performance of the PDANN model were compared with those of a typical ANN model using the same input features and database. The results indicate a slight, but clear, perceptually perceived preference towards the PDANN
- …