890 research outputs found
A sticky HDP-HMM with application to speaker diarization
We consider the problem of speaker diarization, the problem of segmenting an
audio recording of a meeting into temporal segments corresponding to individual
speakers. The problem is rendered particularly difficult by the fact that we
are not allowed to assume knowledge of the number of people participating in
the meeting. To address this problem, we take a Bayesian nonparametric approach
to speaker diarization that builds on the hierarchical Dirichlet process hidden
Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006)
1566--1581]. Although the basic HDP-HMM tends to over-segment the audio
data---creating redundant states and rapidly switching among them---we describe
an augmented HDP-HMM that provides effective control over the switching rate.
We also show that this augmentation makes it possible to treat emission
distributions nonparametrically. To scale the resulting architecture to
realistic diarization problems, we develop a sampling algorithm that employs a
truncated approximation of the Dirichlet process to jointly resample the full
state sequence, greatly improving mixing rates. Working with a benchmark NIST
data set, we show that our Bayesian nonparametric architecture yields
state-of-the-art speaker diarization results.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS395 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
The non-Verbal Structure of Patient Case Discussions in Multidisciplinary Medical Team Meetings
Meeting analysis has a long theoretical tradition in social psychology, with established practical rami?cations in computer science, especially in computer supported cooperative work. More recently, a good deal of research has focused on the issues of indexing and browsing multimedia records of meetings. Most research in this area, however, is still based on data collected in laboratories, under somewhat arti?cial conditions. This paper presents an analysis of the discourse structure and spontaneous interactions at real-life multidisciplinary medical team meetings held as part of the work routine in a major hospital. It is hypothesised that the conversational structure of these meetings, as indicated by sequencing and duration of vocalisations, enables segmentation into individual patient case discussions. The task of segmenting audio-visual records of multidisciplinary medical team meetings is described as a topic segmentation task, and a method for automatic segmentation is proposed. An empirical evaluation based on hand labelled data is presented which determines the optimal length of vocalisation sequences for segmentation, and establishes the competitiveness of the method with approaches based on more complex knowledge sources. The effectiveness of Bayesian classi?cation as a segmentation method, and its applicability to meeting segmentation in other domains are discusse
Social Network Analysis for Automatic Role Recognition
The computing community has shown a significant interest for the analysis of social interactions in the last decade. Different aspects of social interactions have been studied such as dominance, emotions, conflicts, etc. However, the recognition of roles has been neglected whereas these are a key aspect of social interactions. In fact, sociologists have shown not only that people play roles each time they interact, but also that roles shape behavior and expectations of interacting participants. The aim of this thesis is to fill this gap by investigating the problem of automatic role recognition in a wide range of interaction settings, including production environments, e.g. news and talk-shows, and spontaneous exchanges, e.g. meetings. The proposed role recognition approach includes two main steps. The first step aims at representing the individuals involved in an interaction with feature vectors accounting for their relationships with others. This step includes three main stages, namely segmentation of audio into turns (i.e. time intervals during which only one person talks), conversion of the sequence of turns into a social network, and use of the social network as a tool to extract features for each person. The second step uses machine learning methods to map the feature vectors into roles. The experiments have been carried out over roughly 90 hours of material. This is not only one of the largest databases ever used in literature on role recognition, but also the only one, to the best of our knowledge, including different interaction settings. In the experiments, the accuracy of the percentage of data correctly labeled in terms of roles is roughly 80% in production environments and 70% in spontaneous exchanges (lexical features have been added in the latter case). The importance of roles has been assessed in an application scenario as well. In particular, the thesis shows that roles help to segment talk-shows into stories, i.e. time intervals during which a single topic is discussed, with satisfactory performance. The main contributions of this thesis are as follows: To the best of our knowledge, this is the first work where social network analysis is applied to automatic analysis of conversation recordings. This thesis provides the first quantitative measure of how much roles constrain conversations, and a large corpus of recordings annotated in terms of roles. The results of this work have been published in one journal paper, and in five conference articles
Automatic social role recognition and its application in structuring multiparty interactions
Automatic processing of multiparty interactions is a research domain with important applications in content browsing, summarization and information retrieval. In recent years, several works have been devoted to find regular patterns which speakers exhibit in a multiparty interaction also known as social roles. Most of the research in literature has generally focused on recognition of scenario specific formal roles. More recently, role coding schemes based on informal social roles have been proposed in literature, defining roles based on the behavior speakers have in the functioning of a small group interaction. Informal social roles represent a flexible classification scheme that can generalize across different scenarios of multiparty interaction. In this thesis, we focus on automatic recognition of informal social roles and exploit the influence of informal social roles on speaker behavior for structuring multiparty interactions. To model speaker behavior, we systematically explore various verbal and non verbal cues extracted from turn taking patterns, vocal expression and linguistic style. The influence of social roles on the behavior cues exhibited by a speaker is modeled using a discriminative approach based on conditional random fields. Experiments performed on several hours of meeting data reveal that classification using conditional random fields improves the role recognition performance. We demonstrate the effectiveness of our approach by evaluating it on previously unseen scenarios of multiparty interaction. Furthermore, we also consider whether formal roles and informal roles can be automatically predicted by the same verbal and nonverbal features. We exploit the influence of social roles on turn taking patterns to improve speaker diarization under distant microphone condition. Our work extends the Hidden Markov model (HMM)- Gaussian mixture model (GMM) speaker diarization system, and is based on jointly estimating both the speaker segmentation and social roles in an audio recording. We modify the minimum duration constraint in HMM-GMM diarization system by using role information to model the expected duration of speaker's turn. We also use social role n-grams as prior information to model speaker interaction patterns. Finally, we demonstrate the application of social roles for the problem of topic segmentation in meetings. We exploit our findings that social roles can dynamically change in conversations and use this information to predict topic changes in meetings. We also present an unsupervised method for topic segmentation which combines social roles and lexical cohesion. Experimental results show that social roles improve performance of both speaker diarization and topic segmentation
Recommended from our members
Multimodal News Summarization, Tracking and Annotation Incorporating Tensor Analysis of Memes
We demonstrate four novel multimodal methods for efficient video summarization and comprehensive cross-cultural news video understanding.
First, For video quick browsing, we demonstrate a multimedia event recounting system. Based on nine people-oriented design principles, it summarizes YouTube-like videos into short visual segments (812sec) and textual words (less than 10 terms). In the 2013 Trecvid Multimedia Event Recounting competition, this system placed first in recognition time efficiency, while remaining above average in description accuracy.
Secondly, we demonstrate the summarization of large amounts of online international news videos. In order to understand an international event such as Ebola virus, AirAsia Flight 8501 and Zika virus comprehensively, we present a novel and efficient constrained tensor factorization algorithm that first represents a video archive of multimedia news stories concerning a news event as a sparse tensor of order 4. The dimensions correspond to extracted visual memes, verbal tags, time periods, and cultures. The iterative algorithm approximately but accurately extracts coherent quad-clusters, each of which represents a significant summary of an important independent aspect of the news event. We give examples of quad-clusters extracted from tensors with at least 108 entries derived from international news coverage. We show the method is fast, can be tuned to give preferences to any subset of its four dimensions, and exceeds three existing methods in performance.
Thirdly, noting that the co-occurrence of visual memes and tags in our summarization result is sparse, we show how to model cross-cultural visual meme influence based on normalized PageRank, which more accurately captures the rates at which visual memes are reposted in a specified time period in a specified culture.
Lastly, we establish the correspondences of videos and text descriptions in different cultures by reliable visual cues, detect culture-specific tags for visual memes and then annotate videos in a cultural settings. Starting with any video with less text or no text in one culture (say, US), we select candidate annotations in the text of another culture (say, China) to annotate US video. Through analyzing the similarity of images annotated by those candidates, we can derive a set of proper tags from the viewpoints of another culture (China). We illustrate cultural-based annotation examples by segments of international news. We evaluate the generated tags by cross-cultural tag frequency, tag precision, and user studies
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
- âŠ