256 research outputs found
Automatic Segmentation of Broadcast News Audio using Self Similarity Matrix
Generally audio news broadcast on radio is com- posed of music, commercials,
news from correspondents and recorded statements in addition to the actual news
read by the newsreader. When news transcripts are available, automatic
segmentation of audio news broadcast to time align the audio with the text
transcription to build frugal speech corpora is essential. We address the
problem of identifying segmentation in the audio news broadcast corresponding
to the news read by the newsreader so that they can be mapped to the text
transcripts. The existing techniques produce sub-optimal solutions when used to
extract newsreader read segments. In this paper, we propose a new technique
which is able to identify the acoustic change points reliably using an acoustic
Self Similarity Matrix (SSM). We describe the two pass technique in detail and
verify its performance on real audio news broadcast of All India Radio for
different languages.Comment: 4 pages, 5 image
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Spot the conversation: speaker diarisation in the wild
The goal of this paper is speaker diarisation of videos collected 'in the
wild'. We make three key contributions. First, we propose an automatic
audio-visual diarisation method for YouTube videos. Our method consists of
active speaker detection using audio-visual methods and speaker verification
using self-enrolled speaker models. Second, we integrate our method into a
semi-automatic dataset creation pipeline which significantly reduces the number
of hours required to annotate videos with diarisation labels. Finally, we use
this pipeline to create a large-scale diarisation dataset called VoxConverse,
collected from 'in the wild' videos, which we will release publicly to the
research community. Our dataset consists of overlapping speech, a large and
diverse speaker pool, and challenging background conditions.Comment: The dataset will be available for download from
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The
development set will be released in July 2020, and the test set will be
released in October 202
TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
This paper describes the TSUP team's submission to the ISCSLP 2022
conversational short-phrase speaker diarization (CSSD) challenge which
particularly focuses on short-phrase conversations with a new evaluation metric
called conversational diarization error rate (CDER). In this challenge, we
explore three kinds of typical speaker diarization systems, which are spectral
clustering(SC) based diarization, target-speaker voice activity
detection(TS-VAD) and end-to-end neural diarization(EEND) respectively. Our
major findings are summarized as follows. First, the SC approach is more
favored over the other two approaches under the new CDER metric. Second, tuning
on hyperparameters is essential to CDER for all three types of speaker
diarization systems. Specifically, CDER becomes smaller when the length of
sub-segments setting longer. Finally, multi-system fusion through DOVER-LAP
will worsen the CDER metric on the challenge data. Our submitted SC system
eventually ranks the third place in the challenge
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
The performance of most speaker diarization systems with x-vector embeddings
is both vulnerable to noisy environments and lacks domain robustness. Earlier
work on speaker diarization using generative adversarial network (GAN) with an
encoder network (ClusterGAN) to project input x-vectors into a latent space has
shown promising performance on meeting data. In this paper, we extend the
ClusterGAN network to improve diarization robustness and enable rapid
generalization across various challenging domains. To this end, we fetch the
pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical
loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments
are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II
(dev set) which includes challenging multi-domain corpus, and two
child-clinician interaction corpora (ADOS, BOSCC) related to the autism
spectrum disorder domain. Extensive analyses of the experimental data are done
to investigate the effectiveness of the proposed ClusterGAN and MCGAN
embeddings over x-vectors. The results show that the proposed embeddings with
normalized maximum eigengap spectral clustering (NME-SC) back-end consistently
outperform Kaldi state-of-the-art z-vector diarization system. Finally, we
employ embedding fusion with x-vectors to provide further improvement in
diarization performance. We achieve a relative diarization error rate (DER)
improvement of 6.67% to 53.93% on the aforementioned datasets using the
proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide
better performance in the number of speakers estimation and short speech
segment diarization as compared to x-vectors and ClusterGAN in telephonic data.Comment: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE
PROCESSIN
- …