43 research outputs found

    Overlapped Speech Detection in Multi-Party Meetings

    Get PDF
    Detection of simultaneous speech in meeting recordings is a difficult problem due both to the complexity of the meeting itself and the environment surrounding it. The system proposes the use of gammatone-like spectrogram-based linear predictor coefficients on distant microphone channel data for overlap detection functions. The framework utilized the Augmented Multiparty Interaction (AMI) conference corpus to assess model performance. The proposed system offers enhancements over base line feature set models for classification

    On the improvement of speaker diarization by detecting overlapped speech

    Get PDF
    Simultaneous speech in meeting environment is responsible for a certain amount of errors caused by standard speaker diarization systems. We are presenting an overlap detection system for far-field data based on spectral and spatial features, where the spatial features obtained on different microphone pairs are fused by means of principal component analysis. Detected overlap segments are applied for speaker diarization in order to increase the purity of speaker clusters and to recover missed speech by assigning multiple speaker labels. Investigation on the relationship between overlap detection properties and diarization improvement revealed very distinct behaviour of overlap exclusion and overlap labeling.Postprint (published version

    Optimized Speaker Diarization System using Discrete Wavelet Transform and Pyknogram

    Get PDF
    The aim of this paper is to present an optimized speaker diarization system that efficiently detects speaker change points in multispeaker speech data. Speaker diarization is the process to detect speaker turns and group together segments uttered by the same speaker. It can be used in speaker recognition, audio information retrieval, audio transcription, audio clustering, indexing and captioning of TV shows and movies. In this proposed technique, the daubechies 40-wavelet transform is used to compress the audio stream in the ratio of 1:4; their features are extracted by enhanced spectrogram called pyknogram based on Teaser Kaiser Energy Operator (TKEO). This method relies on resonances (formants) and harmonic structure of speech which are enhanced by decomposing the spectral sub-bands into amplitude and frequency components. The weighted average of the instantaneous frequency components are used to derive a short-time estimate value for the dominant frequency in each subband over a fixed period of time 0.12msec. Sudden changes in the dominant frequency correspond to the speaker change point and are detected by using traditional delta Bayesian Information Criteria (?BIC). This technique do not uses voice activity detection process (VAD). For re-segmentation, Information Change Rate (ICR) is used. Finally, hierarchical clustering algorithm make groups of homogeneous segments and are plotted by Dendrogram function in Matlab. The results are evaluated by F-measure and diarization error rate. It shows that the proposed method gives fast and better results as compared to traditional method with Mel frequency cepstral coefficient (MFCC) and Bayesian Information Criteria (BIC) algorithms

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlation-based parameters for overlap detection on distant microphone channel data. Spatial features from dierent microphone pairs are fused by means of principal component analysis or by an approach involving a multilayer perceptron. In addition, we investigate the possibility of employing long-term prosodic information. The most suitable subset of candidate prosodic features is determined by a two-step mRMR feature selection algorithm. For segments including detected overlapping speech the speaker diarization system picks a second speaker label, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in the Viterbi-decoding part of the diarization algorithm.Peer ReviewedPostprint (published version

    Predicting continuous conflict perception with Bayesian Gaussian processes

    Get PDF
    Conflict is one of the most important phenomena of social life, but it is still largely neglected by the computing community. This work proposes an approach that detects common conversational social signals (loudness, overlapping speech, etc.) and predicts the conflict level perceived by human observers in continuous, non-categorical terms. The proposed regression approach is fully Bayesian and it adopts Automatic Relevance Determination to identify the social signals that influence most the outcome of the prediction. The experiments are performed over the SSPNet Conflict Corpus, a publicly available collection of 1430 clips extracted from televised political debates (roughly 12 hours of material for 138 subjects in total). The results show that it is possible to achieve a correlation close to 0.8 between actual and predicted conflict perception

    Modeling Overlapping Speech using Vector Taylor Series

    Get PDF
    Abstract Current speaker diarization systems typically fail to succesfully assign multiple speakers speaking simultaneously. According to previous studies, overlapping errors account for a large proportion of the total errors in multi-party speech diarization. In this work, we propose a new approach using Vector Taylor Series (VTS) to obtain overlapping speech models assuming individual speaker models are available, e.g. from the diarization output. We extend the VTS framework to use multiple acoustic classes to account for the non-stationarity of corrupting speaker speech. We propose a system using multi-class VTS to detect single-speaker and two-speaker overlapping speech as well as the speakers involved. We show the effectivity of the approach on distant microphone meeting data, especially with the multiclass approach performing at the state-of-the-art
    corecore