66 research outputs found

    ESTIMATING THE DOMINANT PERSON IN MULTI-PARTY CONVERSATIONS USING SPEAKER DIARIZATION STRATEGIES

    Get PDF
    In this paper, we apply speaker diarization strategies from a single source to the task of estimating the dominant person in a group meeting. Previous work has shown that speaking length is strongly correlated with perceived dominance. Here we investigate this in more depth by considering two dominance tasks where there is full and majority agreement amongst ground-truth annotators. In addition, we investigate how 24 different speed-up and algorithmic strategies, and source types lead to interesting outcomes when applied to dominance estimation. We obtained the best performance of 77%77\% using our slowest scheme and a single distant microphone (SDM). Within the top 3 out of 24 performing experiments in both dominance tasks, we show that we can use the furthest SDM, with no prior knowledge of the number of speakers and the fastest diarization scheme, which performs 1.31.3 times faster than real-time

    Identifying Dominant People in Meetings from Audio-Visual Sensors

    Get PDF
    This paper provides an overview of the area of automated dominance estimation in group meetings. We describe research in social psychology and use this to explain the motivations behind suggested automated systems. With the growth in availability of conversational data captured in meeting rooms, it is possible to investigate how multi-sensor data allows us to characterize non-verbal behaviors that contribute towards dominance. We use an overview of our own work to address the challenges and opportunities in this area of research

    SALSA: A Novel Dataset for Multimodal Group Behavior Analysis

    Get PDF
    Studying free-standing conversational groups (FCGs) in unstructured social settings (e.g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels. However, analyzing social scenes involving FCGs is also highly challenging due to the difficulty in extracting behavioral cues such as target locations, their speaking activity and head/body pose due to crowdedness and presence of extreme occlusions. To this end, we propose SALSA, a novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis, and make two main contributions to research on automated social interaction analysis: (1) SALSA records social interactions among 18 participants in a natural, indoor environment for over 60 minutes, under the poster presentation and cocktail party contexts presenting difficulties in the form of low-resolution images, lighting variations, numerous occlusions, reverberations and interfering sound sources; (2) To alleviate these problems we facilitate multimodal analysis by recording the social interplay using four static surveillance cameras and sociometric badges worn by each participant, comprising the microphone, accelerometer, bluetooth and infrared sensors. In addition to raw data, we also provide annotations concerning individuals' personality as well as their position, head, body orientation and F-formation information over the entire event duration. Through extensive experiments with state-of-the-art approaches, we show (a) the limitations of current methods and (b) how the recorded multiple cues synergetically aid automatic analysis of social interactions. SALSA is available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure

    Non-Verbal Communication Analysis in Victim-Offender Mediations

    Full text link
    In this paper we present a non-invasive ambient intelligence framework for the semi-automatic analysis of non-verbal communication applied to the restorative justice field. In particular, we propose the use of computer vision and social signal processing technologies in real scenarios of Victim-Offender Mediations, applying feature extraction techniques to multi-modal audio-RGB-depth data. We compute a set of behavioral indicators that define communicative cues from the fields of psychology and observational methodology. We test our methodology on data captured in real world Victim-Offender Mediation sessions in Catalonia in collaboration with the regional government. We define the ground truth based on expert opinions when annotating the observed social responses. Using different state-of-the-art binary classification approaches, our system achieves recognition accuracies of 86% when predicting satisfaction, and 79% when predicting both agreement and receptivity. Applying a regression strategy, we obtain a mean deviation for the predictions between 0.5 and 0.7 in the range [1-5] for the computed social signals.Comment: Please, find the supplementary video material at: http://sunai.uoc.edu/~vponcel/video/VOMSessionSample.mp

    Social Network Analysis for Automatic Role Recognition

    Get PDF
    The computing community has shown a significant interest for the analysis of social interactions in the last decade. Different aspects of social interactions have been studied such as dominance, emotions, conflicts, etc. However, the recognition of roles has been neglected whereas these are a key aspect of social interactions. In fact, sociologists have shown not only that people play roles each time they interact, but also that roles shape behavior and expectations of interacting participants. The aim of this thesis is to fill this gap by investigating the problem of automatic role recognition in a wide range of interaction settings, including production environments, e.g. news and talk-shows, and spontaneous exchanges, e.g. meetings. The proposed role recognition approach includes two main steps. The first step aims at representing the individuals involved in an interaction with feature vectors accounting for their relationships with others. This step includes three main stages, namely segmentation of audio into turns (i.e. time intervals during which only one person talks), conversion of the sequence of turns into a social network, and use of the social network as a tool to extract features for each person. The second step uses machine learning methods to map the feature vectors into roles. The experiments have been carried out over roughly 90 hours of material. This is not only one of the largest databases ever used in literature on role recognition, but also the only one, to the best of our knowledge, including different interaction settings. In the experiments, the accuracy of the percentage of data correctly labeled in terms of roles is roughly 80% in production environments and 70% in spontaneous exchanges (lexical features have been added in the latter case). The importance of roles has been assessed in an application scenario as well. In particular, the thesis shows that roles help to segment talk-shows into stories, i.e. time intervals during which a single topic is discussed, with satisfactory performance. The main contributions of this thesis are as follows: To the best of our knowledge, this is the first work where social network analysis is applied to automatic analysis of conversation recordings. This thesis provides the first quantitative measure of how much roles constrain conversations, and a large corpus of recordings annotated in terms of roles. The results of this work have been published in one journal paper, and in five conference articles

    Privacy-Sensitive Audio Features for Conversational Speech Processing

    Get PDF
    The work described in this thesis takes place in the context of capturing real-life audio for the analysis of spontaneous social interactions. Towards this goal, we wish to capture conversational and ambient sounds using portable audio recorders. Analysis of conversations can then proceed by modeling the speaker turns and durations produced by speaker diarization. However, a key factor against the ubiquitous capture of real-life audio is privacy. Particularly, recording and storing raw audio would breach the privacy of people whose consent has not been explicitly obtained. In this thesis, we study audio features instead – for recording and storage – that can respect privacy by minimizing the amount of linguistic information, while achieving state-of-the-art performance in conversational speech processing tasks. Indeed, the main contributions of this thesis are the achievement of state-of-the-art performances in speech/nonspeech detection and speaker diarization tasks using such features, which we refer to, as privacy-sensitive. Besides this, we provide a comprehensive analysis of these features for the two tasks in a variety of conditions, such as indoor (predominantly) and outdoor audio. To objectively evaluate the notion of privacy, we propose the use of human and automatic speech recognition tests, with higher accuracy in either being interpreted as yielding lower privacy. For the speech/nonspeech detection (SND) task, this thesis investigates three different approaches to privacy-sensitive features. These approaches are based on simple, instantaneous, feature extraction methods, excitation source information based methods, and feature obfuscation methods. These approaches are benchmarked against Perceptual Linear Prediction (PLP) features under many conditions on a large meeting dataset of nearly 450 hours. Additionally, automatic speech (phoneme) recognition studies on TIMIT showed that the proposed features yield low phoneme recognition accuracies, implying higher privacy. For the speaker diarization task, we interpret the extraction of privacy-sensitive features as an objective that maximizes the mutual information (MI) with speakers while minimizing the MI with phonemes. The source-filter model arises naturally out of this formulation. We then investigate two different approaches for extracting excitation source based features, namely Linear Prediction (LP) residual and deep neural networks. Diarization experiments on the single and multiple distant microphone scenarios from the NIST rich text evaluation datasets show that these features yield a performance close to the Mel Frequency Cepstral coefficients (MFCC) features. Furthermore, listening tests support the proposed approaches in terms of yielding low intelligibility in comparison with MFCC features. The last part of the thesis studies the application of our methods to SND and diarization in outdoor settings. While our diarization study was more preliminary in nature, our study on SND brings about the conclusion that privacy-sensitive features trained on outdoor audio yield performance comparable to that of PLP features trained on outdoor audio. Lastly, we explored the suitability of using SND models trained on indoor conditions for the outdoor audio. Such an acoustic mismatch caused a large drop in performance, which could not be compensated even by combining indoor models

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rôle crucial dans l’interaction homme-robot (HRI). Le système de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de réagir en consequence. Dans un scénario de conversation, un groupe de personnes peut discuter devant le robot et se déplacer librement. Dans de telles situations, les robots sont censés comprendre où sont les gens, ceux qui parlent et de quoi ils parlent. Cette thèse se concentre sur les deux premières questions, à savoir le suivi et la diarisation des locuteurs. Nous utilisons différentes modalités du système de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scénario de conversation. Les progrès de la vision par ordinateur et du traitement audio de la dernière décennie ont révolutionné les capacités de perception des robots. Dans cette thèse, nous développons les contributions suivantes : nous développons d’abord un cadre variationnel bayésien pour suivre plusieurs objets. Le cadre bayésien variationnel fournit des solutions explicites, rendant le processus de suivi très efficace. Cette approche est d’abord appliqué au suivi visuel de plusieurs personnes. Les processus de créations et de destructions sont en adéquation avecle modèle probabiliste proposé pour traiter un nombre variable de personnes. De plus, nous exploitons la complémentarité de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut être intégré au système de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent être utilisées pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinées dans le modèle variationnel, pour lisser les trajectoires et déduire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario où l’informationvisuelle est absente, nous essayons le modèle pour la localisation et le suivi des locuteurs basé sur l’information acoustique uniquement. Les techniques de déréverbération sont d’abord appliquées, dont le résultat est fourni au système de suivi. Enfin, une variante du modèle de suivi des locuteurs basée sur la distribution de von-Mises est proposée, celle-ci étant plus adaptée aux données directionnelles. Toutes les méthodes proposées sont validées sur des bases de données specifiques à chaque application

    Dominant speaker detection in multipoint video communication using Markov chain with non-linear weights and dynamic transition window

    Get PDF
    This paper proposes an enhanced discrete-time Markov chain algorithm in predicting dominant speaker(s) for multipoint video communication system in the presence of transient speech. The proposed algorithm exploits statistical properties of the past speech patterns to accurately predict the dominant speaker for the next time state. Non-linear weights-based coefficients are employed in the enhanced Markov chain for both the initial state vector and transition probability matrix. These weights significantly improve the time taken to predict a new dominant speaker during a conference session. In addition, a mechanism to dynamically modify the size of the transition probability matrix window/container is introduced to improve the adaptability of the Markov chain towards the variability of speech characteristics. Simulation results indicate that for an 11 conference participants test scenario, the enhanced Markov chain prediction algorithm registered an 85% accuracy in predicting a dominant speaker when compared to an ideal case where there is no transient speech. Misclassification of dominant speakers due to transient speech was also reduced by 87%

    New insights into hierarchical clustering and linguistic normalization for speaker diarization

    Get PDF
    Face au volume croissant de données audio et multimédia, les technologies liées à l'indexation de données et à l'analyse de contenu ont suscité beaucoup d'intérêt dans la communauté scientifique. Parmi celles-ci, la segmentation et le regroupement en locuteurs, répondant ainsi à la question 'Qui parle quand ?' a émergé comme une technique de pointe dans la communauté de traitement de la parole. D'importants progrès ont été réalisés dans le domaine ces dernières années principalement menés par les évaluations internationales du NIST. Tout au long de ces évaluations, deux approches se sont démarquées : l'une est bottom-up et l'autre top-down. L'ensemble des systèmes les plus performants ces dernières années furent essentiellement des systèmes types bottom-up, cependant nous expliquons dans cette thèse que l'approche top-down comporte elle aussi certains avantages. En effet, dans un premier temps, nous montrons qu'après avoir introduit une nouvelle composante de purification des clusters dans l'approche top-down, nous obtenons des performances comparables à celles de l'approche bottom-up. De plus, en étudiant en détails les deux types d'approches nous montrons que celles-ci se comportent différemment face à la discrimination des locuteurs et la robustesse face à la composante lexicale. Ces différences sont alors exploitées au travers d'un nouveau système combinant les deux approches. Enfin, nous présentons une nouvelle technologie capable de limiter l'influence de la composante lexicale, source potentielle d'artefacts dans le regroupement et la segmentation en locuteurs. Notre nouvelle approche se nomme Phone Adaptive Training par analogie au Speaker Adaptive TrainingThe ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research. Speaker diarization, commonly referred to as the who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community. Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering). Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up. Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component leading to competitive performance to the bottom-up approach. Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e.\ that which does not pertain to different speakers. This difference of behaviours leads to a new top-down/bottom-up system combination outperforming the respective baseline system. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT).PARIS-Télécom ParisTech (751132302) / SudocSudocFranceF
    • …
    corecore