479 research outputs found

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Abordando la medición automática de la experiencia de la audiencia en línea

    Get PDF
    Trabajo de Fin de Grado del Doble Grado en Ingeniería Informática y Matemáticas, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2020/2021The availability of automatic and personalized feedback is a large advantage when facing an audience. An effective way to give such feedback is to analyze the audience experience, which provides valuable information about the quality of a speech or performance. In this document, we present the design and implementation of a computer vision system to automatically measure audience experience. This includes the definition of a theoretical and practical framework grounded on the theatrical perspective to quantify this concept, the development of an artificial intelligence system which serves as a proof-of-concept of our approach, and the creation of a dataset to train our system. To facilitate the data collection step, we have also created a custom video conferencing tool. Additionally, we present the evaluation of our artificial intelligence system and the final conclusions.La disponibilidad de feedback automático y personalizado supone una gran ventaja a la hora de enfrentarse a un público. Una forma efectiva de dar este tipo de feedback es analizar la experiencia de la audiencia, que proporciona información fundamental sobre la calidad de una ponencia o actuación. En este documento exponemos el diseño e implementación de un sistema automático de medición de la experiencia de la audiencia basado en la visión por computador. Esto incluye la definición de un marco teórico y práctico fundamentado en la perspectiva del mundo del teatro para cuantificar el concepto de experiencia de la audiencia, el desarrollo de un sistema basado en inteligencia artificial que sirve como prototipo de nuestra aproximación y la recopilación un conjunto de datos para entrenar el sistema. Para facilitar este último paso hemos desarrolado una aplicación de videoconferencias personalizada. Además, en este trabajo presentamos la evaluación de nuestro sistema de inteligencia artificial y las conclusiones extraídas.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu

    A Multimodal Approach to Sarcasm Detection on Social Media

    Get PDF
    In recent times, a major share of human communication takes place online. The main reason being the ease of communication on social networking sites (SNSs). Due to the variety and large number of users, SNSs have drawn the attention of the computer science (CS) community, particularly the affective computing (also known as emotional AI), information retrieval, natural language processing, and data mining groups. Researchers are trying to make computers understand the nuances of human communication including sentiment and sarcasm. Emotion or sentiment detection requires more insights about the communication than it does for factual information retrieval. Sarcasm detection is particularly more difficult than categorizing sentiment. Because, in sarcasm, the intended meaning of the expression by the user is opposite to the literal meaning. Because of its complex nature, it is often difficult even for human to detect sarcasm without proper context. However, people on social media succeed in detecting sarcasm despite interacting with strangers across the world. That motivates us to investigate the human process of detecting sarcasm on social media where abundant context information is often unavailable and the group of users communicating with each other are rarely well-acquainted. We have conducted a qualitative study to examine the patterns of users conveying sarcasm on social media. Whereas most sarcasm detection systems deal in word-by-word basis to accomplish their goal, we focused on the holistic sentiment conveyed by the post. We argue that utilization of word-level information will limit the systems performance to the domain of the dataset used to train the system and might not perform well for non-English language. As an endeavor to make our system less dependent on text data, we proposed a multimodal approach for sarcasm detection. We showed the applicability of images and reaction emoticons as other sources of hints about the sentiment of the post. Our research showed the superior results from a multimodal approach when compared to a unimodal approach. Multimodal sarcasm detection systems, as the one presented in this research, with the inclusion of more modes or sources of data might lead to a better sarcasm detection model

    Time- and value-continuous explainable affect estimation in-the-wild

    Get PDF
    Today, the relevance of Affective Computing, i.e., of making computers recognise and simulate human emotions, cannot be overstated. All technology giants (from manufacturers of laptops to mobile phones to smart speakers) are in a fierce competition to make their devices understand not only what is being said, but also how it is being said to recognise user’s emotions. The goals have evolved from predicting the basic emotions (e.g., happy, sad) to now the more nuanced affective states (e.g., relaxed, bored) real-time. The databases used in such research too have evolved, from earlier featuring the acted behaviours to now spontaneous behaviours. There is a more powerful shift lately, called in-the-wild affect recognition, i.e., taking the research out of the laboratory, into the uncontrolled real-world. This thesis discusses, for the very first time, affect recognition for two unique in-the-wild audiovisual databases, GRAS2 and SEWA. The GRAS2 is the only database till date with time- and value-continuous affect annotations for Labov effect-free affective behaviours, i.e., without the participant’s awareness of being recorded (which otherwise is known to affect the naturalness of one’s affective behaviour). The SEWA features participants from six different cultural backgrounds, conversing using a video-calling platform. Thus, SEWA features in-the-wild recordings further corrupted by unpredictable artifacts, such as the network-induced delays, frame-freezing and echoes. The two databases present a unique opportunity to study time- and value-continuous affect estimation that is truly in-the-wild. A novel ‘Evaluator Weighted Estimation’ formulation is proposed to generate a gold standard sequence from several annotations. An illustration is presented demonstrating that the moving bag-of-words (BoW) representation better preserves the temporal context of the features, yet remaining more robust against the outliers compared to other statistical summaries, e.g., moving average. A novel, data-independent randomised codebook is proposed for the BoW representation; especially useful for cross-corpus model generalisation testing when the feature-spaces of the databases differ drastically. Various deep learning models and support vector regressors are used to predict affect dimensions time- and value-continuously. Better generalisability of the models trained on GRAS2 , despite the smaller training size, makes a strong case for the collection and use of Labov effect-free data. A further foundational contribution is the discovery of the missing many-to-many mapping between the mean square error (MSE) and the concordance correlation coefficient (CCC), i.e., between two of the most popular utility functions till date. The newly invented cost function |MSE_{XY}/σ_{XY}| has been evaluated in the experiments aimed at demystifying the inner workings of a well-performing, simple, low-cost neural network effectively utilising the BoW text features. Also proposed herein is the shallowest-possible convolutional neural network (CNN) that uses the facial action unit (FAU) features. The CNN exploits sequential context, but unlike RNNs, also inherently allows data- and process-parallelism. Interestingly, for the most part, these white-box AI models have shown to utilise the provided features consistent with the human perception of emotion expression

    Automated Semantic Understanding of Human Emotions in Writing and Speech

    Get PDF
    Affective Human Computer Interaction (A-HCI) will be critical for the success of new technologies that will prevalent in the 21st century. If cell phones and the internet are any indication, there will be continued rapid development of automated assistive systems that help humans to live better, more productive lives. These will not be just passive systems such as cell phones, but active assistive systems like robot aides in use in hospitals, homes, entertainment room, office, and other work environments. Such systems will need to be able to properly deduce human emotional state before they determine how to best interact with people. This dissertation explores and extends the body of knowledge related to Affective HCI. New semantic methodologies are developed and studied for reliable and accurate detection of human emotional states and magnitudes in written and spoken speech; and for mapping emotional states and magnitudes to 3-D facial expression outputs. The automatic detection of affect in language is based on natural language processing and machine learning approaches. Two affect corpora were developed to perform this analysis. Emotion classification is performed at the sentence level using a step-wise approach which incorporates sentiment flow and sentiment composition features. For emotion magnitude estimation, a regression model was developed to predict evolving emotional magnitude of actors. Emotional magnitudes at any point during a story or conversation are determined by 1) previous emotional state magnitude; 2) new text and speech inputs that might act upon that state; and 3) information about the context the actors are in. Acoustic features are also used to capture additional information from the speech signal. Evaluation of the automatic understanding of affect is performed by testing the model on a testing subset of the newly extended corpus. To visualize actor emotions as perceived by the system, a methodology was also developed to map predicted emotion class magnitudes to 3-D facial parameters using vertex-level mesh morphing. The developed sentence level emotion state detection approach achieved classification accuracies as high as 71% for the neutral vs. emotion classification task in a test corpus of children’s stories. After class re-sampling, the results of the step-wise classification methodology on a test sub-set of a medical drama corpus achieved accuracies in the 56% to 84% range for each emotion class and polarity. For emotion magnitude prediction, the developed recurrent (prior-state feedback) regression model using both text-based and acoustic based features achieved correlation coefficients in the range of 0.69 to 0.80. This prediction function was modeled using a non-linear approach based on Support Vector Regression (SVR) and performed better than other approaches based on Linear Regression or Artificial Neural Networks

    Expanding Data Imaginaries in Urban Planning:Foregrounding lived experience and community voices in studies of cities with participatory and digital visual methods

    Get PDF
    “Expanding Data Imaginaries in Urban Planning” synthesizes more than three years of industrial research conducted within Gehl and the Techno–Anthropology Lab at Aalborg University. Through practical experiments with social media images, digital photovoice, and participatory mapmaking, the project explores how visual materials created by citizens can be used within a digital and participatory methodology to reconfigure the empirical ground of data-driven urbanism. Drawing on a data feminist framework, the project uses visual research to elevate community voices and situate urban issues in lived experiences. As a Science and Technology Studies project, the PhD also utilizes its industrial position as an opportunity to study Gehl’s practices up close, unpacking collectively held narratives and visions that form a particular “data imaginary” and contribute to the production and perpetuation of the role of data in urban planning. The dissertation identifies seven epistemological commitments that shape the data imaginary at Gehl and act as discursive closures within their practice. To illustrate how planners might expand on these, the dissertation uses its own data experiments as speculative demonstrations of how to make alternative modes of knowing cities possible through participatory and digital visual methods

    Syntactic difficulties in translation

    Get PDF
    Even though machine translation (MT) systems such as Google Translate and DeepL have improved significantly over the last years, a continuous rise in globalisation and linguistic diversity requires increasing amounts of professional, error-free translation. One can imagine, for instance, that mistakes in medical leaflets can lead to disastrous consequences. Less catastrophic, but equally significant, is the lack of a consistent and creative style of MT systems in literary genres. In such cases, a human translation is preferred. Translating a text is a complex procedure that involves a variety of mental processes such as understanding the original message and its context, finding a fitting translation, and verifying that the translation is grammatical, contextually sound, and generally adequate and acceptable. From an educational perspective, it would be helpful if the translation difficulty of a given text can be predicted, for instance to ensure that texts of objectively appropriate difficulty levels are used in exams and assignments for translators. Also in the translation industry it may prove useful, for example to direct more difficult texts to more experienced translators. During this PhD project, my coauthors and I investigated which linguistic properties contribute to such difficulties. Specifically, we put our attention to syntactic differences between a source text and its translation, that is to say their (dis)similarities in terms of linguistic structure. To this end we developed new measures that can quantify such differences and made the implementation publicly available for other researchers to use. These metrics include word (group) movement (how does the order in the original text differ from that in a given translation), changes in the linguistic properties of words, and a comparison of the underlying abstract structure of a sentence and a translation. Translation difficulty cannot be directly measured but process information can help. Particularly, keystroke logging and eye-tracking data can be recorded during translation and used as a proxy for the required cognitive effort. An example: the longer a translator looks at a word, the more time and effort they likely need to process it. We investigated the effect that specific measures of syntactic similarity have on these behavioural processing features to get an indication of what their effect is on the translation difficulty. In short: how does the syntactic (dis)similarity between a source text and a possible translation impact the translation difficulty? In our experiments, we show that different syntactic properties indeed have an effect, and that differences in syntax between a source text and its translation affect the cognitive effort required to translate that text. These effects are not identical between syntactic properties, though, suggesting that individual syntactic properties affect the translation process in different ways and that not all syntactic dissimilarities contribute to translation difficulty equally.De kwaliteit van machinevertaalsystemen (MT) zoals Google Translate en DeepL is de afgelopen jaren sterk verbeterd. Door alsmaar meer globalisering en taalkundige diversiteit is er echter meer dan ooit nood aan professionele vertalingen waar geen fouten in staan. In zekere communicatievormen zouden vertaalfouten namelijk tot desastreuse gevolgen kunnen leiden, bijvoorbeeld in medische bijsluiters. Ook in minder levensbedreigende situaties verkiezen we nog steeds menselijke vertalingen, bijvoorbeeld daar waar een creatieve en consistente stijl noodzakelijk is, zoals in boeken en poëzie. Een tekst vertalen is een complex karwei waarin verschillende mentale processen een rol spelen. Zo moet bijvoorbeeld de brontekst gelezen en begrepen worden, moet er naar een vertaling gezocht worden, en daarbovenop moet tijdens het vertaalproces de vertaling continu gecontroleerd worden om te zorgen dat het ten eerste een juiste vertaling is en ten tweede dat de tekst ook grammaticaal correct is in de doeltaal. Vanuit een pedagogisch standpunt zou het nuttig zijn om de vertaalmoeilijkheid van een tekst te voorspellen. Zo wordt ervoor gezorgd dat de taken en examens van vertaalstudenten tot een objectief bepaald moeilijkheidsniveau behoren. Ook in de vertaalindustrie zou zo’n systeem van toepassing zijn; moeilijkere teksten kunnen aan de meest ervaren vertalers worden bezorgd. Samen met mijn medeauteurs heb ik tijdens dit doctoraatsproject onderzocht welke eigenschappen van een tekst bijdragen tot vertaalmoeilijkheden. We legden daarbij de nadruk op taalkundige, structurele verschillen tussen de brontekst en diens vertaling, en ontwikkelden verscheidene metrieken om dit soort syntactische verschillen te kunnen meten. Zo kan bijvoorbeeld een verschillende woord(groep)volgorde worden gekwantificeerd, kunnen verschillen in taalkundige labels worden geteld, en kunnen de abstracte, onderliggende structuren van een bronzin en een vertaling vergeleken worden. We maakten de implementatie van deze metrieken openbaar beschikbaar. De vertaalmoeilijkheid van een tekst kan niet zomaar gemeten worden, maar door naar gedragsdata van een vertaler te kijken, krijgen we wel een goed idee van de moeilijkheden waarmee ze geconfronteerd werden. De bewegingen en focuspunten van de ogen van de vertaler en hun toetsaanslagen kunnen worden geregistreerd en nadien gebruikt in een experimentele analyse. Ze geven ons nuttig informatie en kunnen zelfs dienen als een benadering van de nodige inspanning die geleverd moest worden tijdens het vertaalproces. Daarmee leidt het ons ook naar de elementen (woorden, woordgroepen) waar de vertaler moeilijkheden mee had. Als een vertaler lang naar een woord kijkt, dan kunnen we aannemen dat de verwerking ervan veel inspanning vergt. We kunnen deze gedragsdata dus gebruiken als een maat voor moeilijkheid. In ons onderzoek waren we voornamelijk benieuwd naar het effect van syntactische verschillen tussen een bronzin en een doelzin op dit soort gedragsdata. Onze resultaten tonen aan dat de voorgestelde metrieken inderdaad een effect hebben en dat taalkundige verschillen tussen een bron- en doeltekst leiden tot een hogere cognitieve belasting tijdens het vertalen van een tekst. Deze effecten verschillen per metriek, wat duidt op het belang van (onderzoek naar) individuele syntactische metrieken; niet elke metriek draagt even veel bij aan vertaalmoeilijkheden

    Improving Hybrid Brainstorming Outcomes with Scripting and Group Awareness Support

    Get PDF
    Previous research has shown that hybrid brainstorming, which combines individual and group methods, generates more ideas than either approach alone. However, the quality of these ideas remains similar across different methods. This study, guided by the dual-pathway to creativity model, tested two computer-supported scaffolds – scripting and group awareness support – for enhancing idea quality in hybrid brainstorming. 94 higher education students,grouped into triads, were tasked with generating ideas in three conditions. The Control condition used standard hybrid brainstorming without extra support. In the Experimental 1 condition, students received scripting support during individual brainstorming, and students in the Experimental 2 condition were provided with group awareness support during the group phase in addition. While the quantity of ideas was similar across all conditions, the Experimental 2 condition produced ideas of higher quality, and the Experimental 1 condition also showed improved idea quality in the individual phase compared to the Control condition
    corecore