88,603 research outputs found
Recommended from our members
Systematic comparison of BIC-based speaker segmentation systems
Unsupervised speaker change detection is addressed in this paper. Three speaker segmentation systems are examined. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic fusion scheme, and applies the Bayesian Information Criterion (BIC). The second system consists of three modules. In the first module, a second-order statistic-measure is extracted; the Euclidean distance and the T2 Hotelling statistic are applied sequentially in the second module; and BIC is utilized in the third module. The third system, first uses a metric-based approach, in order to detect potential speaker change points, and then the BIC criterion is applied to validate the previously detected change points. Experiments are carried out on a dataset, which is created by concatenating speakers from the TIMIT database. A systematic performance comparison among the three systems is carried out by means of one-way ANOVA method and post hoc Tukey’s method
A Novel Method For Speech Segmentation Based On Speakers' Characteristics
Speech Segmentation is the process change point detection for partitioning an
input audio stream into regions each of which corresponds to only one audio
source or one speaker. One application of this system is in Speaker Diarization
systems. There are several methods for speaker segmentation; however, most of
the Speaker Diarization Systems use BIC-based Segmentation methods. The main
goal of this paper is to propose a new method for speaker segmentation with
higher speed than the current methods - e.g. BIC - and acceptable accuracy. Our
proposed method is based on the pitch frequency of the speech. The accuracy of
this method is similar to the accuracy of common speaker segmentation methods.
However, its computation cost is much less than theirs. We show that our method
is about 2.4 times faster than the BIC-based method, while the average accuracy
of pitch-based method is slightly higher than that of the BIC-based method.Comment: 14 pages, 8 figure
Speaker change detection using BIC: a comparison on two datasets
Abstract — This paper addresses the problem of unsupervised speaker change detection. We assume that there is no prior knowledge on the number of speakers or their identities. Two methods are tested. The first method uses the Bayesian Information Criterion (BIC), investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, and implements a dynamic thresholding followed by a fusion scheme. The second method is a real-time one that uses a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential change point. The experiments are carried out on two different datasets. The first set was created by concatenating speakers from the TIMIT database and is referred to as the TIMIT data set. The second set was created by using recordings from the MPEG-7 test set CD1 and broadcast news and is referred to as the INESC dataset. I
Encoder-decoder multimodal speaker change detection
The task of speaker change detection (SCD), which detects points where
speakers change in an input, is essential for several applications. Several
studies solved the SCD task using audio inputs only and have shown limited
performance. Recently, multimodal SCD (MMSCD) models, which utilise text
modality in addition to audio, have shown improved performance. In this study,
the proposed model are built upon two main proposals, a novel mechanism for
modality fusion and the adoption of a encoder-decoder architecture. Different
to previous MMSCD works that extract speaker embeddings from extremely short
audio segments, aligned to a single word, we use a speaker embedding extracted
from 1.5s. A transformer decoder layer further improves the performance of an
encoder-only MMSCD model. The proposed model achieves state-of-the-art results
among studies that report SCD performance and is also on par with recent work
that combines SCD with automatic speech recognition via human transcription.Comment: 5 pages, accepted for presentation at INTERSPEECH 202
Echo Cancellation - A Likelihood Ratio Test for Double-talk Versus Channel Change
Echo cancellers are in wide use in both electrical (four wire to two wire mismatch) and acoustic (speaker-microphone coupling) applications. One of the main design problems is the control logic for adaptation. Basically, the algorithm weights should be frozen in the presence of double-talk and adapt quickly in the absence of double-talk. The control logic can be quite complicated since it is often not easy to discriminate between the echo signal and the near-end speaker. This paper derives a log likelihood ratio test (LRT) for deciding between double-talk (freeze weights) and a channel change (adapt quickly) using a stationary Gaussian
stochastic input signal model. The probability density function of a sufficient statistic under each hypothesis is obtained and the performance of the test is evaluated as a function of the system parameters. The receiver operating characteristics (ROCs) indicate that it is difficult to correctly decide between double-talk and a channel change based upon a single look. However, post-detection integration of approximately one hundred sufficient statistic samples yields a detection probability close to unity (0.99) with a small false alarm probability (0.01)
Automatic speaker segmentation using multiple features and distance measures: a comparison of three approaches
This paper addresses the problem of unsupervised speaker change detection. Three systems based on the Bayesian Information Criterion (BIC) are tested. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic thresholding followed by a fusion scheme, and finally applies BIC. The second method is a real-time one that uses a metric-based approach employing the line spectral pairs and the BIC to validate a potential speaker change point. The third method consists of three modules. In the first module, a measure based on second-order statistics is used; in the second module, the Euclidean distance and T2 Hotelling statistic are applied; and in the third module, the BIC is utilized. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. A comparison between the performance of the three systems is made based on t-statistics
Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection
Speaker change detection (SCD) is an important task in dialog modeling. Our
paper addresses the problem of text-based SCD, which differs from existing
audio-based studies and is useful in various scenarios, for example, processing
dialog transcripts where speaker identities are missing (e.g., OpenSubtitle),
and enhancing audio SCD with textual information. We formulate text-based SCD
as a matching problem of utterances before and after a certain decision point;
we propose a hierarchical recurrent neural network (RNN) with static
sentence-level attention. Experimental results show that neural networks
consistently achieve better performance than feature-based approaches, and that
our attention-based model significantly outperforms non-attention neural
networks.Comment: In Proceedings of the ACM on Conference on Information and Knowledge
Management (CIKM), 201
Сегментація мовних голосових сигналів за ознакою зміни диктора
Запропоновано підхід до сегментації голосових мовних сигналів за ознакою зміни диктора та способи
визначення позицій зміни диктора в голосовому мовному сигналі. Позиції зміни диктора визначаються за
допомогою аналізу множин характеристичних векторів в околі паузи на основі Байєсівського інформаційного
критерію. Покращення якості характеристичних векторів досягається за допомогою використання сегментів
з рівнем енергії не нижче певного порогу. Також пропонується адаптивний підхід для автоматичного
визначення пауз у мовному сигналі.An approach to the segmentation of speech signals based on the speaker change, as well as to the detection of
the speaker change positions in a speech signal is suggested. Speaker change positions are determined by
analyzing the sets of characteristic vectors at the pause within the signal based on the Bayesian information
criterion. Improvement in quality of the characteristic vectors is achieved by taking into account only the
segments with the log energy above the given threshold. It is also suggested an approach for adaptive
automatic pause detection in speech signal
Unsupervised Speaker Change Detection for Broadcast News Segmentation
This paper presents a speaker change detection system for news broadcast segmentation based on a vector quantization (VQ) approach. The system does not make any assumption about the number of speakers or speaker identity. The system uses mel frequency cepstral coefficients and change detection is done using the VQ distortion measure and is evaluated against two other statistics, namely the symmetric Kullback-Leibler (KL2) distance and the so-called ‘divergence shape distance’. First level alarms are further tested using the VQ distortion. We find that the false alarm rate can be reduced without significant losses in the detection of correct changes. We furthermore evaluate the generalizability of the approach by testing the complete system on an independent set of broadcasts, including a channel not present in the training set. 1
- …