30 research outputs found

    Location Based Speaker Segmentation

    Get PDF
    This paper proposes a technique that segments into speaker turns based on their location, essentially implementing a discrete source tracking system. In many multi-party conversations, such as meetings or teleconferences, the location of participants is restricted to a small number of regions, such as seats around a table. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. A GMM/HMM framework is used to determine an optimal segmentation of the audio according to these locations. We also demonstrate how this approach is easily extended to more complex cases, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of the extension to handle dual-speaker overlap

    Time delay estimation of reverberant meeting speech: on the use of multichannel linear prediction

    Get PDF
    Effective and efficient access to multiparty meeting recordings requires techniques for meeting analysis and indexing. Since meeting participants are generally stationary, speaker location information may be used to identify meeting events e.g., detect speaker changes. Time-delay estimation (TDE) utilizing cross-correlation of multichannel speech recordings is a common approach for deriving speech source location information. Research improved TDE by calculating TDE from linear prediction (LP) residual signals obtained from LP analysis on each individual speech channel. This paper investigates the use of LP residuals for speech TDE, where the residuals are obtained from jointly modeling the multiple speech channels. Experiments conducted with a simulated reverberant room and real room recordings show that jointly modeled LP better predicts the LP coefficients, compared to LP applied to individual channels. Both the individually and jointly modeled LP exhibit similar TDE performance, and outperform TDE on the speech alone, especially with the real recordings

    A simulated annealing approach to speaker segmentation in audio databases

    Get PDF
    In this paper we present a novel approach to the problem of speaker segmentation, which is an unavoidable previous step to audio indexing. Mutual information is used for evaluating the accuracy of the segmentation, as a function to be maximized by a simulated annealing (SA) algorithm. We introduce a novel mutation operator for the SA, the Consecutive Bits Mutation operator, which improves the performance of the SA in this problem. We also use the so-called Compaction Factor, which allows the SA to operate in a reduced search space. Our algorithm has been tested in the segmentation of real audio databases, and it has been compared to several existing algorithms for speaker segmentation, obtaining very good results in the test problems considered

    Unsupervised Location-Based Segmentation of Multi-Party Speech

    Get PDF
    Accurate detection and segmentation of spontaneous multi-party speech is crucial for a variety of applications, including speech acquisition and recognition, as well as higher-level event recognition. However, the highly sporadic nature of spontaneous speech makes this task difficult. Moreover, multi-party speech contains many overlaps. We propose to attack this problem as a tracking task, using location cues only. In order to best deal with high sporadicity, we propose a novel, generic, short-term clustering algorithm that can track multiple objects for a low computational cost. The proposed approach is online, fully deterministic and can run in real-time. In an application to real meeting data, the algorithm produces high precision speech segmentation

    Offline speaker segmentation using genetic algorithms and mutual information

    Get PDF
    We present an evolutionary approach to speaker segmentation, an activity that is especially important prior to speaker recognition and audio content analysis tasks. Our approach consists of a genetic algorithm (GA), which encodes possible segmentations of an audio record, and a measure of mutual information between the audio data and possible segmentations, which is used as fitness function for the GA. We introduce a compact encoding of the problem into the GA which reduces the length of the GA individuals and improves the GA convergence properties. Our algorithm has been tested on the segmentation of real audio data, and its performance has been compared with several existing algorithms for speaker segmentation, obtaining very good results in all test problems.This work was supported in part by the Universidad de Alcalá under Project UAH PI2005/078

    Short-Term Spatio-Temporal Clustering of Sporadic and Concurrent Events

    Get PDF
    Accurate detection and segmentation of spontaneous multi-party speech is crucial for a variety of applications, including speech acquisition and recognition, as well as higher-level event recognition. However, the highly sporadic nature of spontaneous speech makes this task difficult. Moreover, multi-party speech contains many overlaps. We propose to attack this problem as a multitarget tracking task, using location cues only. In order to best deal with high sporadicity, we propose a novel, generic, short-term clustering algorithm that can track multiple objects for a low computational cost. The proposed approach is online, fully deterministic and can run in real-time. In an application to real meeting data, the algorithm produces high precision speech segmentation

    An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

    Full text link

    Browsing Recorded Meetings with Ferret

    Get PDF
    Browsing for elements of interest within a recorded meeting is time-consuming. We describe work in progress on a meeting browser, which aims to support this process by displaying many types of data. These include media, transcripts and processing results, such as speaker segmentations. Users interact with these visualizations to observe and control synchronized playback of the recorded meeting

    Audio Spatio-Temporal Fingerprints for Cloudless Real-Time Hands-Free Diarization on Mobile Devices

    Get PDF
    In this paper, we propose a new low bit rate representation of a sound field and a new method for the corresponding cloudless low delay hands-free diarization suitable for low-performance mobile devices, e.g. mobile phones. The proposed audio spatio-temporal fingerprint representation results in low bit rate (500 bytes/second), however contains complete information about continuous audio tracking of multiple acoustic sources in an open, unconstrained environment. The core of the algorithm is based on simultaneous multiple data stream processing using audio spatio-temporal fingerprint representation to cover higher level events relevant for diarization, e.g. turns, interruptions, crosstalk, speech and non-speech segments. Performance levels achieved to date on 5 hours of hand-labelled datasets have shown the feasibility of the approach at the same time as resulting in 7.58% CPU load on 1-core ultra-low-power mobile processor running at 1 GHz and low algorithmic delay of 112 ms

    Integration of TDOA Features in Information Bottleneck Framework for Fast Speaker Diarization

    Get PDF
    In this paper we address the combination of multiple feature streams in a fast speaker diarization system for meeting recordings. Whenever Multiple Distant Microphones (MDM) are used, it is possible to estimate the Time Delay of Arrival (TDOA) for different channels. In \cite{xavi_comb}, it is shown that TDOA can be used as additional features together with conventional spectral features for improving speaker diarization. We investigate here the combination of TDOA and spectral features in a fast diarization system based on the Information Bottleneck principle. We evaluate the algorithm on the NIST RT06 diarization task. Adding TDOA features to spectral features reduces the speaker error by 3\% absolute. Results are comparable to those of conventional HMM/GMM based systems with consistent reduction in computational complexity
    corecore