10 research outputs found

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlation-based parameters for overlap detection on distant microphone channel data. Spatial features from dierent microphone pairs are fused by means of principal component analysis or by an approach involving a multilayer perceptron. In addition, we investigate the possibility of employing long-term prosodic information. The most suitable subset of candidate prosodic features is determined by a two-step mRMR feature selection algorithm. For segments including detected overlapping speech the speaker diarization system picks a second speaker label, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in the Viterbi-decoding part of the diarization algorithm.Peer ReviewedPostprint (published version

    UPC system for the 2015 MediaEval multimodal person discovery in broadcast TV task

    Get PDF
    This paper describes a system to identify people in broadcast TV shows in a purely unsupervised manner. The system outputs the identity of people that appear, talk and can be identified by using information appearing in the show (in our case, text with person names). Three types of monomodal technologies are used: speech diarization, video diarization and text detection / named entity recognition. These technologies are combined using a linear programming approach where some restrictions are imposed.Postprint (published version

    Speaker diarization of overlapping speech based on silence distribution in meeting recordings

    Get PDF
    Speaker diarization of meetings can be significantly improved by overlap handling. Several previous works have explored the use of different features such as spectral, spatial and energy for overlap detection. This paper proposes a method to estimate probabilities of speech and overlap classes at a segment level which are later incorporated into an HMM/GMM baseline system. The estimation is motivated by the observation that significant portion of overlaps in spontaneous conversations take place where the amount of silence is less, e.g., during speaker changes. Experiments on the AMI corpus reveal that the probability of occurrence of overlap in a segment is inversely proportional to the amount of silence in it. Whenever this information is combined with acoustic information from MFCC features in an HMM/GMM overlap detector, improvements are verified in terms of F-measure. Furthermore the paper investigates the use of exclusion and labelling strategies based on such detector for handling overlap in diarization reporting F-measure improvements from 0.29 to 0.43 in case of exclusion and from 0.15 to 0.22 in case of labelling. Consequently speaker diarization error is reduced by 8% relative compared to the baseline based solely on acoustic information

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    Overlapping speech detection using long-term conversational features for speaker diarization in meeting room conversations

    Get PDF
    Overlapping speech has been identified as one of the main sources of errors in diarization of meeting room conversations. Therefore, overlap detection has become an important step prior to speaker diarization. Studies on conversational analysis have shown that overlapping speech is more likely to occur at specific parts of a conversation. They have also shown that overlap occurrence is correlated with various conversational features such as speech, silence patterns and speaker turn changes. We use features capturing this higher level information from structure of a conversation such as silence and speaker change statistics to improve acoustic feature based classifier of overlapping and single-speaker speech classes. The silence and speaker change statistics are computed over a long-term window (around 3-4 seconds) and are used to predict the probability of overlap in the window. These estimates are then incorporated into a acoustic feature based classifier as prior probabilities of the classes. Experiments conducted on three corpora (AMI, NIST-RT and ICSI) have shown that the proposed method improves the performance of acoustic feature-based overlap detector on all the corpora. They also reveal that the model based on long-term conversational features used to estimate probability of overlap which is learned from AMI corpus generalizes to meetings from other corpora (NIST-RT and ICSI). Moreover, experiments on ICSI corpus reveal that the proposed method also improves laughter overlap detection. Consequently, applying overlap handling techniques to speaker diarization using the detected overlap results in reduction of diarization error rate (DER) on all the three corpora

    The detection of overlapping speech with prosodic features for speaker diarization

    No full text
    Overlapping speech is responsible for a certain amount of errors produced by tandard speaker diarization systems in meeting environment. We are investigating a set of prosody-based long-term features as a potential complement to our overlap detection system relying on short-term spectral parameters. The most relevant features are selected in a two-step process. They are firstly evaluated and sorted according to mRMR criterion and then the optimal number is determined by iterative wrapper approach. We show that the addition of prosodic features decreased overlap detection error. Detected overlap segments are used in speaker diarization to recover missed speech by assigning multiple speaker labels and to increase the purity of speaker clusters.Peer ReviewedPostprint (published version

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT Âż09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling

    The detection of overlapping speech with prosodic features for speaker diarization

    No full text
    Overlapping speech is responsible for a certain amount of errors produced by tandard speaker diarization systems in meeting environment. We are investigating a set of prosody-based long-term features as a potential complement to our overlap detection system relying on short-term spectral parameters. The most relevant features are selected in a two-step process. They are firstly evaluated and sorted according to mRMR criterion and then the optimal number is determined by iterative wrapper approach. We show that the addition of prosodic features decreased overlap detection error. Detected overlap segments are used in speaker diarization to recover missed speech by assigning multiple speaker labels and to increase the purity of speaker clusters.Peer Reviewe
    corecore