117 research outputs found

    DNN-based speaker clustering for speaker diarisation

    Get PDF
    Speaker diarisation, the task of answering "who spoke when?", is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and nonspeech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concerned with speaker clustering, which is typically performed by bottom-up clustering using the Bayesian information criterion (BIC). We present a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model. A speaker separation DNN trained on independent data is used to iteratively relabel the test data set. This is achieved by reconfiguration of the output layer, combined with fine tuning in each iteration. A stopping criterion involving posteriors as confidence scores is investigated. Results are shown on a meeting task (RT07) for single distant microphones and compared with standard diarisation approaches. The new method achieves a diarisation error rate (DER) of 14.8%, compared to a baseline of 19.9%

    The 2015 Sheffield System for Longitudinal Diarisation of Broadcast Media

    Get PDF
    Speaker diarisation is the task of answering "who spoke when" within a multi-speaker audio recording. Diarisation of broadcast media typically operates on individual television shows, and is a particularly difficult task, due to a high number of speakers and challenging background conditions. Using prior knowledge, such as that from previous shows in a series, can improve performance. Longitudinal diarisation allows to use knowledge from previous audio files to improve performance, but requires finding matching speakers across consecutive files. This paper describes the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge. The challenge required longitudinal diarisation of data from BBC archives, under very constrained resource settings. Our system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows. The final result on the development set of 19 shows from five different television series provided a Diarisation Error Rate of 50.77% in the diarisation and linking task

    DNN approach to speaker diarisation using speaker channels

    Get PDF
    Speaker diarisation addresses the question of 'who speaks when' in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc. Performing diarisation on individual headset microphone (IHM) channels is sometimes assumed to easily give the desired output of speaker labelled segments with timing information. However, it is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case. Deep neural networks (DNNs) can be trained on features derived from the concatenation of speaker channel features to detect which is the correct channel for each frame. Crosstalk features can be calculated and DNNs trained with or without overlapping speech to combat problematic data. A simple frame decision metric of counting occurrences is investigated as well as adding a bias against selecting nonspeech for a frame. Finally, two different scoring setups are applied to both datasets. The stricter SHEF setup finds diarisation error rates (DER) of 9.2% on TBL and 23.2% on RT07 while the NIST setup achieves 5.7% and 15.1% respectively

    The Kriston AI System for the VoxCeleb Speaker Recognition Challenge 2022

    Full text link
    This technical report describes our system for track 1, 2 and 4 of the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). By combining several ResNet variants, our submission for track 1 attained a minDCF of 0:090 with EER 1:401%. By further incorporating three fine-tuned pre-trained models, our submission for track 2 achieved a minDCF of 0:072 with EER 1:119%. For track 4, our system consisted of voice activity detection (VAD), speaker embedding extraction, agglomerative hierarchical clustering (AHC) followed by a re-clustering step based on a Bayesian hidden Markov model and overlapped speech detection and handling. Our submission for track 4 achieved a diarisation error rate (DER) of 4.86%. The submissions all ranked the 2nd places for the corresponding tracks.Comment: System description of VoxSRC 2022: track 1, 2 and

    Automatic assessment of motivational interview with diabetes patients

    Get PDF
    Diabetes cost the UK NHS £10 billion each year, and the cost pressure is projected to get worse. Motivational Interviewing (MI) is a goal-driven clinical conversation that seeks to reduce this cost by encouraging patients to take ownership of day-to-day monitoring and medication, whose effectiveness is commonly evaluated against the Motivational Interviewing Treatment Integrity (MITI) manual. Unfortunately, measuring clinicians’ MI performance is costly, requiring expert human instructors to ensure the adherence of MITI. Although it is desirable to assess MI in an automated fashion, many challenges still remain due to its complexity. In this thesis, an automatic system to assess clinicians adherence to the MITI criteria using different spoken language techniques was developed. The system tackled the chal- lenges using automatic speech recognition (ASR), speaker diarisation, topic modelling and clinicians’ behaviour code identification. For ASR, only 8 hours of in-domain MI data are available for training. The experiments with different open-source datasets, for example, WSJCAM0 and AMI, are presented. I have explored adaptative training of the ASR system and also the best training criterion and neural network structure. Over 45 minutes of MI testing data, the best ASR system achieves 43.59% word error rate. The i-vector based diarisation system achieves an F-measure of 0.822. The MITI behaviour code classification system with manual transcriptions achieves an accuracy of 78% for Non Question/Question classification, an accuracy of 80% for Open Question/Closed Question classification and an accuracy of 78% for MI Adherence and MI Non-Adherence classification. Topic modelling was applied to track whether the conversation segments were related to ‘diabetes’ or not on manual transcriptions as well as ASR outputs. The full automatic assessment system achieve an Assessment Error Rate of 22.54%. This is the first system that targets the full automation of MI assessment with reasonable performance. In addition, the error analysis from each step is able to guide future research in this area for further improvement and optimisation

    Using Deep Neural Networks for Speaker Diarisation

    Get PDF
    Speaker diarisation answers the question “who spoke when?” in an audio recording. The input may vary, but a system is required to output speaker labelled segments in time. Typical stages are Speech Activity Detection (SAD), speaker segmentation and speaker clustering. Early research focussed on Conversational Telephone Speech (CTS) and Broadcast News (BN) domains before the direction shifted to meetings and, more recently, broadcast media. The British Broadcasting Corporation (BBC) supplied data through the Multi-Genre Broadcast (MGB) Challenge in 2015 which showed the difficulties speaker diarisation systems have on broadcast media data. Diarisation is typically an unsupervised task which does not use auxiliary data or information to enhance a system. However, methods which do involve supplementary data have shown promise. Five semi-supervised methods are investigated which use a combination of inputs: different channel types and transcripts. The methods involve Deep Neural Networks (DNNs) for SAD, DNNs trained for channel detection, transcript alignment, and combinations of these approaches. However, the methods are only applicable when datasets contain the required inputs. Therefore, a method involving a pretrained Speaker Separation Deep Neural Network (ssDNN) is investigated which is applicable to every dataset. This technique performs speaker clustering and speaker segmentation using DNNs successfully for meeting data and with mixed results for broadcast media. The task of diarisation focuses on two aspects: accurate segments and speaker labels. The Diarisation Error Rate (DER) does not evaluate the segmentation quality as it does not measure the number of correctly detected segments. Other metrics exist, such as boundary and purity measures, but these also mask the segmentation quality. An alternative metric is presented based on the F-measure which considers the number of hypothesis segments correctly matched to reference segments. A deeper insight into the segment quality is shown through this metric
    • 

    corecore