Search CORE

969 research outputs found

Measuring prosodic alignment in cooperative task-based conversations

Author: Heylen Dirk K.J.
Truong Khiet Phuong
Publication venue: International Speech Communication Association (ISCA)
Publication date: 01/09/2012
Field of study

University of Twente Research Information

Acomodación fonética durante las interacciones conversacionales: una visión general

Author: Barón Leonardo
Publication venue: Universidad de San Buenaventura Cali
Publication date: 22/03/2023
Field of study

During conversational interactions such as tutoring, instruction-giving tasks, verbal negotiations, or just talking with friends, interlocutors’ behaviors experience a series of changes due to the characteristics of their counterpart and to the interaction itself. These changes are pervasively present in every social interaction, and most of them occur in the sounds and rhythms of our speech, which is known as acoustic-prosodic accommodation, or simply phonetic accommodation. The consequences, linguistic and social constraints, and underlying cognitive mechanisms of phonetic accommodation have been studied for at least 50 years, due to the importance of the phenomenon to several disciplines such as linguistics, psychology, and sociology. Based on the analysis and synthesis of the existing empirical research literature, in this paper we present a structured and comprehensive review of the qualities, functions, onto- and phylogenetic development, and modalities of phonetic accommodation.Durante las interacciones conversacionales como dar una tutoría, dar instrucciones, las negociaciones verbales, o simplemente hablar con amigos, los comportamientos de las personas experimentan una serie de cambios debido a las características de su interlocutor y a la interacción en sí. Estos cambios están presentes en cada interacción social, y la mayoría de ellos ocurre en los sonidos y ritmos del habla, lo cual se conoce como acomodación acústico-prosódica, o simplemente acomodación fonética. Las consecuencias, las limitaciones lingüísticas y sociales, y los mecanismos cognitivos subyacentes a la acomodación fonética se han estudiado durante al menos 50 años, debido a la importancia del fenómeno para varias disciplinas como la lingüística, la psicología, y la sociología. A partir del análisis y síntesis de la literatura de investigación empírica existente, en este artículo presentamos una revisión estructurada y exhaustiva de las cualidades, funciones, desarrollo onto- y filogenético, y modalidades de la acomodación fonética

Universidad de San Buenaventura, sede Bogotá: Editorial Bonaventuriana

Conversational Alignment: A Study of Neural Coherence and Speech Entrainment

Author: Borrie Stephanie A.
Gillam Ronald B.
Jensen Kristen M.
Studenka Breanna E.
Publication venue: DigitalCommons@USU
Publication date: 01/05/2016
Field of study

Conversational alignment refers to the tendency for communication partners to adjust their verbal and non-verbal behaviors to become more like one another during the course of human interaction. This alignment phenomenon has been observed in neural patterns, specifically in the prefrontal areas of the brain (Holper et al., 2013; Cui et al., 2012; Dommer et al., 2012; Holper et al., 2012; Funane et al., 2011; Jiang et al., 2012); verbal behaviors such acoustic speech features (e.g., Borrie & Liss, 2014; Borrie et al., 2015; Lubold & Pon-Barry, 2014), phonological features (e.g., Babel, 2012; Pardo, 2006), lexical selection (e.g., Brennan & Clark, 1996; Garrod & Anderson, 1989), syntactic structure (e.g., Branigan, Pickering, & Cleland, 2000; Reitter, Moore, & Keller, 2006); and motor behaviors including body posture, facial expressions and breathing rate (e.g., Furuyama, Hayashi, & Mishima, 2005; Louwerse, Dale, Bard, & Jeuniaux, 2012; Richardson, March, & Schmit, 2005; Shockley, Santana, & Fowler, 2003; McFarland, 2001). While conversational alignment in itself, is a largely physical phenomenon, it has been linked to significant functional value, both in the cognitive and social domains. Cognitively, conversational alignment facilitates spoken message comprehension, enabling listeners to share mental models (Garrod & Pickering, 2004) and generate temporal predictions about upcoming aspects of speech. From a social perspective, behavioral alignment has been linked with establishing turn-taking behaviors, and with increased feelings of rapport, empathy, and intimacy between conversational pairs (e.g., Lee et al. 2010; Nind, & Macrae, 2009; Smith, 2008; Bailenson & Yee, 2005; Chartrand & Barg, 1999; Miles, Putman & Street, 1984; Street & Giles, 1982). Benus (2014), for example, observed that individuals who align their speech features are perceived as more socially attractive and likeable, and have interactions that are more successful. These cognitive and social benefits, associated with conversational alignment, have been observed in both linguistic and neural data (e.g., Holper et al., 2012; 2013, Cui et al. 2012; Jiang et al., 2012; Egetemeir et al., 2011; Stephens et al. 2010). The purpose of the current study was to examine conversational alignment as a multi-level communication phenomenon, by examining the relationship between neural and speech behaviors. To assess neural alignment, we used Near-Infrared Spectroscopy (NIRS), a non-invasive neuroimaging technology that detects cortical increases and decreases in the concentration of oxygenated and deoxygenated hemoglobin at multiple measurement sites to determine the rate that oxygen is being released and absorbed (Ferrari & Quaresima, 2012). While still considered a relatively new neural imaging technique, NIRS has been well established as an efficacious and effective data collection approach, particularly appropriate for social interaction research (e.g., Holper et al., 2013; Jiang et al., 2012; Holper et al., 2012; Suda et al., 2010). We utilized hyperscanning, a technique that allows for the quantitation of two simultaneous signals, allowing us to document neural alignment between two individuals (Babiloni & Astolfi, 2012). Recent studies have revealed neural alignment between two persons in cooperative states, including alignment in the right superior frontal cortices and medial prefrontal regions (Cui et al., 2012; Dommer et al., 2012; Funane et al., 2011). This increased prefrontal interbrain alignment has also been observed in other social interactions, including joint attention tasks (Dommer et al., 2012), imitation tasks (Holper et al., 2012), competitive games (Cheng et al., 2015, Duan et al., 2013), teaching-learning interactions (Holper et al., 2013), face- to-face communication (Jiang et al., 2012), mother-child interactions (Hirata et al., 2014), and during cooperative singing tasks (Osaka et al., 2015). Interestingly, Jiang et al. (2012) showed that increased neural alignment only occurred between conversational participants when they were speaking face-to-face, but not when participants had their backs facing one another. The authors speculated that the multi-sensory information, for example motor behaviors such as gestures, was required for neural alignment to occur

DigitalCommons@USU

A Study of Prosodic Entrainment and Social Factors in Mandarin Conversations

Author: Xia Zhihua
Publication venue: SCHOLINK CO.,LTD
Publication date: 29/12/2023
Field of study

In conversations, interlocutors usually adopt prosody to that of their partner, and they become similar in prosodic production for successful communication. This phenomenon of prosodic entrainment is related to complex factors. This study aims to explore the relationship between prosodic entrainment and social factors. Two analyses are accomplished: the analysis of prosodic entrainment and gender, and the analysis of prosodic entrainment and role. In terms of prosodic entrainment and gender, it is found that the most prosodic features are entrained in female-male conversations, and the least in male-male conversations. In terms of prosodic entrainment and roles, it is found that different roles have influence on the entrainment degree, and information givers entrain more to followers in conversation

Scholink Journals

LINGUISTIC ENTRAINMENT IN MULTI-PARTY SPOKEN DIALOGUES

Author: Rahimi Zahra
Publication venue
Publication date: 30/08/2019
Field of study

Entrainment is the propensity of speakers to begin behaving like one another in conversations. Evidence of entrainment has been found in multiple aspects of speech, including acoustic-prosodic and lexical. More interestingly, the strength of entrainment has been shown to be associated with numerous conversational qualities, such as social variables. These two characteristics make entrainment an interesting research area for multiple disciplines, such as natural language processing and psychology. To date, mainly simple methods such as unweighted averaging have been used to move from pairs to groups, and the focus of prior multi-party work has been on text rather than speech (e.g., Wikipedia, Twitter, online forums, and corporate emails). The focus of this research, unlike previous studies, is multi-party spoken dialogues. The goal of this work is to develop, validate, and evaluate multi-party entrainment measures that incorporate characteristics of multi-party interactions, and are associated with measures of team outcomes. In this thesis, first, I explore the relation between entrainment on acoustic-prosodic and lexical features and show that they correlate. In addition, I show that a multi-modal model using entrainment features from both of these modalities outperforms the uni-modal model at predicting team outcomes. Moreover, I present enhanced multi-party entrainment measures which utilize dynamics of entrainment in groups for both global and local settings. As for the global entrainment, I present a weighted convergence based on group dynamics. As the first step toward the development of local multi-party measures, I investigate whether local entrainment occurs within a time-lag in groups using a temporal window approach. Next, I propose a novel approach to learn a vector representation of multi-party local entrainment by encoding the structure of the presented multi-party entrainment graphs. The positive results of both the global and local settings indicate the importance of incorporating entrainment dynamics in groups. Finally, I propose a novel approach to incorporate a team-level factor of gender-composition to enhance multi-party entrainment measures. All of the proposed works are in the direction of enhancing multi-party entrainment measures with the focus on spoken dialogues although they can also be employed on text-based communications

D-Scholarship@Pitt

Recommended from our members

Acoustic-Prosodic Entrainment in Human-Human and Human-Computer Dialogue

Author: Levitan Rivka
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2014
Field of study

Entrainment (sometimes called adaptation or alignment) is the tendency of human speakers to adapt to or imitate characteristics of their interlocutors' behavior. This work focuses on entrainment on acoustic-prosodic features. Acoustic-prosodic entrainment has been extensively studied but is not well understood. In particular, it is difficult to compare the results of different studies, since entrainment is usually measured in different ways, reflect- ing disparate conceptualizations of the phenomenon. In the first part of this thesis, we look for evidence of entrainment on a variety of acoustic-prosodic features according to various conceptualizations, and show that human speakers of both Standard American English and Mandarin Chinese entrain to each other globally and locally, in synchrony, and that this entrainment can be constant or convergent. We explore the relationship between entrainment and gender and show that entrainment on some acoustic-prosodic features is related to social behavior and dialogue coordination. In addition, we show that humans entrain in a novel domain, backchannel-inviting cues, and propose and test a novel hypothesis: that entrainment will be stronger in the case of an outlier feature value. In the second part of the thesis, we describe a method for flexibly and dynamically entraining a TTS voice to multiple acoustic-prosodic features of a user's input utterances, and show in an exploratory study that users prefer an entraining avatar to one that does not entrain, are more likely to ask its advice, and choose more positive adjectives to describe its voice. This work introduces a coherent view of entrainment in both familiar and novel domains. Our results add to the body of knowledge of entrainment in human-human conversations and propose new directions for making use of that knowledge to enhance human-computer interactions

Columbia University Academic Commons

A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

Author: Kousidis Spyridon, [Thesis]
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2010
Field of study

Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

Arrow@TUDublin

Vocal accommodation in human-computer interaction : modeling and integration into spoken dialogue systems

Author: Raveh Eran
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2021
Field of study

With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.Heutzutage wird die verbale Interaktion mit Computern immer gebräuchlicher, was der rasant wachsenden Anzahl von sprachaktivierten Geräten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine große Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation repräsentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige Gerät anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. Üblicherweise ereignen sie sich unbewusst und auf natürliche Weise während eines Gesprächs, etwa um die soziale Distanz zwischen den Gesprächsteilnehmern zu kontrollieren oder um die Effizienz des Gesprächs zu verbessern. Dieses Phänomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise während menschlicher Kommunikation statt. Sie äußert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation beschäftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden Ausprägungen dieser Merkmale bei den Gesprächsteilnehmern im Laufe des Gesprächs ähnlicher, spricht man von Konvergenz, vergrößern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsfähiger Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und könnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden Überblick zu Bausteinen bereit, die nötig sind, um Akkommodationsfähigkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden können. Um die erhaltenen Ergebnisse empirisch auswerten zu können, wurden hierbei auch verschiedene Modellierungsansätze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abhängig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenhängen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fußt, dass Menschen im Laufe eines Gesprächs ständig ihren Sprachstil ändern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder übermäßig geglättet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein ähnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der Akkommodationsqualität gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für mögliche Verhaltensweisen anzuwenden. In dieser Arbeit schließt diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren Überblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch Gespräch. Zu diesem Zweck wurde eine Sammlung von echten Verkaufsgesprächen gesammelt und analysiert, wobei in jedem dieser Gespräche ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese Gespräche verschaffen einen Einblick in Akkommodationseffekte während spontanen authentischen Interaktionen, wobei die Gesprächsteilnehmer zwei Ziele verfolgen: zum einen soll ein Geschäft verhandelt werden, zum anderen möchte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen Änderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen Gesprächen deutlich unterscheiden lassen. Überdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen Änderungen sind, d.h. sie können die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt auslösen, was besonders bei erfolgreichen Gesprächen beobachtet werden konnte. Dass diese Akkommodation stärker bei trainierten Sprechern ausgelöst wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen beschäfti

Universaar

Acronym