2,798 research outputs found
On the Assessment of Stability and Patterning of Speech Movements
Speech requires the control of complex movements of orofacial structures to produce dynamic variations in the vocal tract transfer function. The nature of the underlying motor control processes has traditionally been investigated by employing measures of articulatory movements, including movement amplitude, velocity, and duration, at selected points in time. An alternative approach, first used in the study of limb motion, is to examine the entire movement trajectory over time. A new approach to speech movement trajectory analysis was introduced in earlier work from this laboratory. In this method, trajectories from multiple movement sequences are time- and amplitude-normalized, and the STI (spatiotemporal index) is computed to capture the degree of convergence of a set of trajectories onto a single, underlying movement template. This research note describes the rationale for this analysis and provides a detailed description of the signal processing involved. Alternative interpolation procedures for time-normalization of kinematic data are also considered
I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient
Paralinguistic vocal control of interactive media: how untapped elements of voice might enhance the role of non-speech voice input in the user's experience of multimedia.
Much interactive media development, especially commercial development, implies the dominance of the visual modality, with sound as a limited supporting channel. The development of multimedia technologies such as augmented reality and virtual reality has further revealed a distinct partiality to visual media. Sound, however, and particularly voice, have many aspects which have yet to be adequately investigated. Exploration of these aspects may show that sound can, in some respects, be superior to graphics in creating immersive and expressive interactive experiences. With this in mind, this thesis investigates the use of non-speech voice characteristics as a complementary input mechanism in controlling multimedia applications. It presents a number of projects that employ the paralinguistic elements of voice as input to interactive media including both screen-based and physical systems. These projects are used as a means of exploring the factors that seem likely to affect users’ preferences and interaction patterns during non-speech voice control. This exploration forms the basis for an examination of potential roles for paralinguistic voice input. The research includes the conceptual and practical development of the projects and a set of evaluative studies. The work submitted for Ph.D. comprises practical projects (50 percent) and a written dissertation (50 percent). The thesis aims to advance understanding of how voice can be used both on its own and in combination with other input mechanisms in controlling multimedia applications. It offers a step forward in the attempts to integrate the paralinguistic components of voice as a complementary input mode to speech input applications in order to create a synergistic combination that might let the strengths of each mode overcome the weaknesses of the other
Speaking Rate Effects on Normal Aspects of Articulation: Outcomes and Issues
The articulatory effects of speaking rate have been a point of focus for a substantial literature in speech science. The normal aspects of speaking rate variation have influenced theories and models of speech production and perception in the literature pertaining to both normal and disordered speech. While the body of literature pertaining to the articulatory effects of speaking rate change is reasonably large, few speaker-general outcomes have emerged. The purpose of this paper is to review outcomes of the existing literature and address problems related to the study of speaking rate that may be germane to the recurring theme that speaking rate effects are largely idiosyncratic
Speech intelligibility and prosody production in children with cochlear implants
Objectives—The purpose of the current study was to examine the relation between speech intelligibility and prosody production in children who use cochlear implants. Methods—The Beginner\u27s Intelligibility Test (BIT) and Prosodic Utterance Production (PUP) task were administered to 15 children who use cochlear implants and 10 children with normal hearing. Adult listeners with normal hearing judged the intelligibility of the words in the BIT sentences, identified the PUP sentences as one of four grammatical or emotional moods (i.e., declarative, interrogative, happy, or sad), and rated the PUP sentences according to how well they thought the child conveyed the designated mood. Results—Percent correct scores were higher for intelligibility than for prosody and higher for children with normal hearing than for children with cochlear implants. Declarative sentences were most readily identified and received the highest ratings by adult listeners; interrogative sentences were least readily identified and received the lowest ratings. Correlations between intelligibility and all mood identification and rating scores except declarative were not significant. Discussion—The findings suggest that the development of speech intelligibility progresses ahead of prosody in both children with cochlear implants and children with normal hearing; however, children with normal hearing still perform better than children with cochlear implants on measures of intelligibility and prosody even after accounting for hearing age. Problems with interrogative intonation may be related to more general restrictions on rising intonation, and th
Infant prosodic expressions in mother-infant communication
Prosody, generally defined as any perceivable modulation of duration, pitch or
loudness in the voice that conveys meaning, has been identified as part of the
linguistic system, or compared with the sound system of Western classical music.
This thesis proposes a different conception, namely that prosody is a phenomenon of
human expression that precedes, and to a certain extent determines the form and
function of utterances in any particular language or music system. Findings from
studies of phylogenesis and ontogenesis are presented in favour of this definition.
Consequently, prosody of infant vocal expressions, which are made by individuals
who have not yet developed either language or musical skills, is investigated as a
phenomenon in itself, with its own rules.
Recognising theoretical and methodological deficiencies in the linguistic and
the Piagetian approaches to the development of infant prosodic expressions, this
thesis supports the view that the origins of language are to be sought in the
expressive dialogues between the mother and her prelinguistic child that are
generated by intuitive motives for communication. Furthermore, infant vocalisations
are considered as part of a system of communication constituted by all expressive
modalities. Thus, the aim is to investigate the role of infant prosodic expressions in
conveying emotions and communicative functions in relation to the accompanying
non vocal-behaviours.
A crossectional Pilot Study involving 16 infants aged 26 to 56 weeks and their
mothers was undertaken to help in the design of the Main Study. The Main Study
became a case description of two first born infants and their mothers; a boy (Robin)
and a girl (Julie) both aged 30 weeks at the beginning of the study. The infants were
filmed in their home every fortnight for five months in a structured naturalistic
setting which included the following conditions: mother-infant free-play with their
own toys, mother-infant play without using objects, the infant playing alone, motherinfant
play with objects provided by the researcher, a 'car task' for eliciting
cooperative play, and the mother staying unresponsive. Each filming session lasted
approximately thirty minutes. In order to get an insight into the infants' 'meaning
potential' expressed in their vocalisations, the mothers were asked to visit the
department sometime in the interval between two filming sessions and, while
watching the most recent video, to report what they felt their infant was conveyingif
anything- in each vocalisation.
Three types of analysis were carried out:
a) An Analysis of Prosody - An attempt was made to obtain an objective, and not
linguistically based account of infant prosodic features. First measurements were
obtained of the duration and the fundamental frequency curve of each vocalisation
by means of a computer programme for sound analysis. The values of fundamental
frequency were then logarithmically transformed into a semitone scale in order to
obtain measurements more sensitive to the mother's perception.
b) A Functional Micro-Analysis of Non-Vocal Behaviours from Videos - The non
vocal behaviours of mother and infant related with each vocalisation were codified
without sound to examine to what extent the mothers relied for their interpretations
on non-vocal behaviours accompanying vocalisations.
c) An Analysis of the Mothers' Interpretations - The infants' messages were defined
as perceived by their mother.
The corpus comprised 713 vocalisations (322 for the boy and 391 for the girl)
selected from a corpus of 864, and 143 minutes of video recording (64 for the boy
and 79 for the girl). Correlations between the above three assessments were
specified through statistical analysis.
The findings from both infants indicate that between seven and eleven months
prosodic patterns are not related one to one with particular messages. Rather,
prosody distinguishes between groups of messages conveying features of
psychological motivation, such as 'emotional', 'interpersonal', 'referential', 'assertive'
or 'receptive'. Individual messages belonging to the same message group according
to the analysis of prosody, are distinguished on the basis of the accompanying nonvocal
behaviours. Before nine months, 'interpersonal' vocalisations display more
'alerting' prosodic patterns than 'referential' vocalisations. After nine months
prosodic patterns in Robin's vocalisations differentiate between 'assertive' and
'receptive' messages, the former being expressed by more 'alerting' prosodic patterns
than the latter. This distinction reflects a better Self-Other awareness. On the other
hand, Julie's vocalisations occurring in situations of 'Joint Interest' display different
prosodic patterns from her vocalisations uttered in situations of 'Converging Interest'.
These changes in the role infant prosody reflect developments in the infants'
motivational organisation which will lead to a more efficient control of
intersubjective orientation and shared attention to the environment. Moreover, it
was demonstrated that new forms of prosodic expression occur in psychologically
mature situations, while the psychologically novel situations are expressed by
mature prosodic forms. The above results suggest that at the threshold to language, prosody does not
primarily serve identifiable linguistic functions. Rather, in spite of individual
differences in form of their vocalisations, both infants use prosody in combination
with other modalities as part of an expressive system, that conveys information
about their motives. In this way prosody facilitates intersubjective and later
cooperative communication, on which language development is built. To what
extent such prelinguistic prosodic patterns are similar in form to those of the target
language is a crucial issue for further investigation
- …