1,120 research outputs found
From image to text to speech : The effects of speech prosody on information sequencing in audio description
Given the extensive body of research in audio description â the verbal-vocal description of visual or audiovisual content for visually impaired audiences â it is striking how little attention has been paid thus far to the spoken dimension of audio description and its para-linguistic, prosodic aspects. This article complements the previous research into how audio description speech is received by the partially sighted audiences by analyzing how it is performed vocally. We study the audio description of pictorial art, and one aspect of prosody is examined in detail: pitch, and the segmentation of information in relation to it. We analyze this relation in a corpus of audio described pictorial art in Finnish by combining phonetic measurements of the pitch with discourse analysis of the information segmentation. Previous studies have already shown that a sentence-initial high pitch acts as a discourse-structuring device in interpreting. Our study shows that the same applies to audio description. In addition, our study suggests that there is a relationship between the scale in the rise of pitch and the scale of the topical transition. That is, when the topical transition is clear, the rise of pitch level between the beginnings of two consecutive spoken sentences is large. Analogically, when the topical transition is small, the change of the sentence-initial pitch level is also rather small.Given the extensive body of research in audio description â the verbal-vocal description of visual or audiovisual content for visually impaired audiences â it is striking how little attention has been paid thus far to the spoken dimension of audio description and its para-linguistic, prosodic aspects. This article complements the previous research into how audio description speech is received by the partially sighted audiences by analyzing how it is performed vocally. We study the audio description of pictorial art, and one aspect of prosody is examined in detail: pitch, and the segmentation of information in relation to it. We analyze this relation in a corpus of audio described pictorial art in Finnish by combining phonetic measurements of the pitch with discourse analysis of the information segmentation. Previous studies have already shown that a sentence-initial high pitch acts as a discourse-structuring device in interpreting. Our study shows that the same applies to audio description. In addition, our study suggests that there is a relationship between the scale in the rise of pitch and the scale of the topical transition. That is, when the topical transition is clear, the rise of pitch level between the beginnings of two consecutive spoken sentences is large. Analogically, when the topical transition is small, the change of the sentence-initial pitch level is also rather small.Peer reviewe
Recommended from our members
Computational Approaches to Modeling Speaker State in the Medical Domain
Recently, researchers in computer science and engineering have begun to explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods. If such language cues can be identified and quantified automatically, this information can be used to support diagnosis and treatment of medical conditions in clinical settings and to further fundamental research in understanding cognition. This chapter reviews computational approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer. There are two main approaches discussed: research that explores features extracted from the acoustic signal and research that focuses on lexical and semantic features. We also present some applied research that uses computational methods to develop assistive technologies. In the final sections we discuss issues related to and the future of this emerging field of research
Directional adposition use in English, Swedish and Finnish
Directional adpositions such as to the left of describe where a Figure is in relation to a Ground. English and Swedish directional adpositions refer to the location of a Figure in relation to a Ground, whether both are static or in motion. In contrast, the Finnish directional adpositions edellÀ (in front of) and jÀljessÀ (behind) solely describe the location of a moving Figure in relation to a moving Ground (Nikanne, 2003).
When using directional adpositions, a frame of reference must be assumed for interpreting the meaning of directional adpositions. For example, the meaning of to the left of in English can be based on a relative (speaker or listener based) reference frame or an intrinsic (object based) reference frame (Levinson, 1996). When a Figure and a Ground are both in motion, it is possible for a Figure to be described as being behind or in front of the Ground, even if neither have intrinsic features. As shown by Walker (in preparation), there are good reasons to assume that in the latter case a motion based reference frame is involved. This means that if Finnish speakers would use edellÀ (in front of) and jÀljessÀ (behind) more frequently in situations where both the Figure and Ground are in motion, a difference in reference frame use between Finnish on one hand and English and Swedish on the other could be expected.
We asked native English, Swedish and Finnish speakersâ to select adpositions from a language specific list to describe the location of a Figure relative to a Ground when both were shown to be moving on a computer screen. We were interested in any differences between Finnish, English and Swedish speakers.
All languages showed a predominant use of directional spatial adpositions referring to the lexical concepts TO THE LEFT OF, TO THE RIGHT OF, ABOVE and BELOW. There were no differences between the languages in directional adpositions use or reference frame use, including reference frame use based on motion.
We conclude that despite differences in the grammars of the languages involved, and potential differences in reference frame system use, the three languages investigated encode Figure location in relation to Ground location in a similar way when both are in motion.
Levinson, S. C. (1996). Frames of reference and Molyneuxâs question: Crosslingiuistic evidence. In P. Bloom, M.A. Peterson, L. Nadel & M.F. Garrett (Eds.) Language and Space (pp.109-170). Massachusetts: MIT Press.
Nikanne, U. (2003). How Finnish postpositions see the axis system. In E. van der Zee & J. Slack (Eds.), Representing direction in language and space. Oxford, UK: Oxford University Press.
Walker, C. (in preparation). Motion encoding in language, the use of spatial locatives in a motion context. Unpublished doctoral dissertation, University of Lincoln, Lincoln. United Kingdo
Eesti emotsionaalse kÔne korpuse loomine ja emotsioonide taju
VĂ€itekirja elektrooniline versioon ei sisalda publikatsioone.VĂ€itekirja eesmĂ€rk oli luua Eesti emotsionaalse kĂ”ne korpuse teoreetiline alus ja kontrollida loodud korpuse materjali pĂ”hjal teoreetiliste seisukohtade Ă”igsust. Uurimus nĂ€itas, kui oluline on korpust enne selle loomist hoolikalt planeerida ja tulemust analĂŒĂŒsida. Saadud teadmisi saavad rakendada nii emotsiooniuurijad kui ka kĂ”nekorpuste arendajad.
Eesti korpuse teeb teiste kÔneemotsioonikorpuste seas ainulaadseks asjaolu, et lausete emotsioon on mÀrgendatud selle jÀrgi, kas emotsiooni kannab lause heli vÔi mÔjutab emotsiooni Àratundmist hÀÀlest lause verbaalne sisu. Selline jaotus teeb vÔimalikuks emotsioonide uurimise nii kÔnes kui kirjas.
Eesti emotsionaalse kĂ”ne korpus on ĂŒks vĂ€heseid esilekutsutud mÔÔdukalt vĂ€ljendunud emotsioone sisaldavaid kĂ”nekorpusi, mis on dokumenteeritud, avalikult ja tasuta kĂ€ttesaadav. Korpuse jaoks on salvestatud n-ö tavalise inimese etteloetud tekstid, kellele ei ole öeldud, millise emotsiooniga tuleb tekste lugeda.
Kuna Eesti emotsionaalse kĂ”ne korpuses olevate lausete emotsioonid on mÀÀranud kuulajad testidega, on töös olulised emotsioonide tajuga seotud kĂŒsimused. VĂ€itekirja raames on leidnud kinnitust, et kuulajad suudavad hĂ€sti Ă€ra tunda mÔÔdukalt vĂ€ljendatud emotsioone mitteprofessionaalse lugeja hÀÀlest. Uurimistulemused toetavad otsust valida Eesti emotsionaalse kĂ”ne korpuse lausete emotsiooni mÀÀrajateks ĂŒle 30-aastased eesti keelt emakeelena rÀÀkivad tĂ€iskasvanud eestlased, kuna nad suudavad noortest paremini dekodeerida sĂ”numi emotsiooni. Samuti nĂ€itasid tulemused, et emotsioonidest arusaamine on kultuurisĂ”ltlik Uurimistulemused ei kinnitanud empaatia olulist rolli emotsioonide tuvastamisel hÀÀlest, kĂŒll aga nĂ€itasid meeste ja naiste erinevust emotsioonide tuvastamisel.
Korpus on niisugusena, nagu ta teoreetiliselt kavandati olemas ja sisaldab praegu ĂŒhe naishÀÀle lauseid, mis on klassifitseeritud vihaks, rÔÔmuks, kurbuseks ja neutraalsuseks (vt http://peeter.eki.ee:5000). Kuna Eesti emotsionaalse kĂ”ne korpus on kergesti laiendatav, arendatakse seda edasi vastavalt uutele uurimissuundadele.The aim of the thesis was to develop a theoretical base for the Estonian Emotional Speech Corpus and to test the validity of the theoretical starting-points on the Corpus material.
The Corpus is now ready as designed (see http://peeter.eki.ee:5000). The results of the research reveal the importance of detailed planning and of the design elements of the Corpus. The theoretical starting-points of the study are relevant and applicable in real situations. Therefore these results could be taken into consideration in the creation of other emotional speech corpora.
What makes this Corpus unique among the other corpora of its kind is the fact that its sentences have different labels according to whether their emotion is carried just by the sound of the sentence or whether the recognition of their emotion from vocal expression may be influenced by the verbal-semantic content. This classification enables the research of emotions both in speech as well as in writing.
Estonian Emotional Speech Corpus is one of the few freely available documented ones that reviews moderately expressed emotions. The Corpus abandoned acted emotions because of their possible stereotypicality and overactedness. The sentences recorded for the Corpus were read out by a so-called ordinary person, who was not dictated what emotion to use while reading.
The Corpus contains 1,234 Estonian sentences that have passed both reading and listening tests. Test takers identified 908 sentences that expressed anger, joy, sadness, or were neutral.
As the emotions of the sentences contained in the Corpus were determined by listeners, some issues of emotion perception came to the fore: 1) Is sentence emotion identifiable purely from vocal cues, without the speaker being seen? 2) Can age affect the identification of emotion? 3) Is the identification of emotion culturally bound? 4) Does identification depend on the listenersâ empathy?
For the first question asking if the emotion of a sentence can be identified from non-acted vocal expression without the speaker being seen, results confirmed the supposition that listeners can recognize the moderate expression of non-acted emotions from the voice of a non-professional reeder. Also, the results support the decision that the emotions of the sentences in the Estonian Emotional Speech Corpus should be determined by Estonian adults aged over 30 who speak Estonian as their native language because they are more likely to have acquired the skills for decoding the culture-specific expression of emotions. Furthermore, the results imply that the understanding of emotions depends on cultural factors and social interactions, including the social norms specific to one culture. The interpretation of emotional messages is therefore learned in the course of social interactions. Research has shown, that in the recognition of emotion from vocal cues, empathy is less important than clinical results would suggest. In conducting emotion studies for speech technological purposes, it is obviously unnecessary to exclude non-empathic people from the testers for the reason that they may not recognize the emotions expressed if their low empathy level is not due to mental or developmental disorders.
The Corpus continues to be developed according to the requirements of new research directions. As the Corpus is publicly available and accessible for free, its data can be used for tackling different research challenges
Fine-tuning SI Quality Criteria: Could Speech Act Theory be of any Use?
This chapter looks at political rhetoric in the European Parliament, focusing on speech
acts and the way they are conveyed by interpreters. Discourse in the European Parliament
is a specific genre with speech acts constituting an integral rhetorical element of the genre.
Following an analysis of an authentic corpus comprising more than 100 speeches in
four languages, delivered in the European Parliament, the theoretical framework of the
present chapter focuses on speech act theory, and the way it can be used to complement
translation and interpreting theories in a close analysis of SI performances. The aim of the
analysis has been to use authentic data in order to obtain some specific information that
could be applied to interpreter training, as well as suggesting an approach for interpreter
quality assessment
A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications
Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS âturntakingâ behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems
New Approach to Teaching Japanese Pronunciation in the Digital Era - Challenges and Practices
Pronunciation has been a black hole in the L2 Japanese classroom on account of a lack of class time, teacher\u2019s confidence, and consciousness of the need to teach pronunciation, among other reasons. The absence of pronunciation instruction is reported to result in fossilized pronunciation errors, communication problems, and learner frustration. With an intention of making a contribution to improve such circumstances, this paper aims at three goals. First, it discusses the importance, necessity, and e ectiveness of teaching prosodic aspects of Japanese pronunciation from an early stage in acquisition. Second, it shows that Japanese prosody is challenging because of its typological rareness, regardless of the L1 backgrounds of learners. Third and finally, it introduces a new approach to teaching L2 pronunciation with the goal of developing L2 comprehensibility by focusing on essential prosodic features, which is followed by discussions on key issues concerning how to implement the new approach both inside and outside the classroom in the digital era
Sound-Action Symbolism
Recent evidence has shown linkages between actions and segmental elements of speech. For instance, close-front vowels are sound symbolically associated with the precision grip, and front vowels are associated with forward-directed limb movements. The current review article presents a variety of such sound-action effects and proposes that they compose a category of sound symbolism that is based on grounding a conceptual knowledge of a referent in articulatory and manual action representations. In addition, the article proposes that even some widely known sound symbolism phenomena such as the sound-magnitude symbolism can be partially based on similar sensorimotor grounding. It is also discussed that meaning of suprasegmental speech elements in many instances is similarly grounded in body actions. Sound symbolism, prosody, and body gestures might originate from the same embodied mechanisms that enable a vivid and iconic expression of a meaning of a referent to the recipient.Peer reviewe
- âŠ