394 research outputs found

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

    Full text link
    In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201

    Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers

    Get PDF
    A new language recognition technique based on the application of the philosophy of the Shifted Delta Coefficients (SDC) to phone log-likelihood ratio features (PLLR) is described. The new methodology allows the incorporation of long-span phonetic information at a frame-by-frame level while dealing with the temporal length of each phone unit. The proposed features are used to train an i-vector based system and tested on the Albayzin LRE 2012 dataset. The results show a relative improvement of 33.3% in Cavg in comparison with different state-of-the-art acoustic i-vector based systems. On the other hand, the integration of parallel phone ASR systems where each one is used to generate multiple PLLR coefficients which are stacked together and then projected into a reduced dimension are also presented. Finally, the paper shows how the incorporation of state information from the phone ASR contributes to provide additional improvements and how the fusion with the other acoustic and phonotactic systems provides an important improvement of 25.8% over the system presented during the competition

    Extended phone log-likelihood ratio features and acoustic-based I-vectors for language recognition

    Get PDF
    This paper presents new techniques with relevant improvements added to the primary system presented by our group to the Albayzin 2012 LRE competition, where the use of any additional corpora for training or optimizing the models was forbidden. In this work, we present the incorporation of an additional phonotactic subsystem based on the use of phone log-likelihood ratio features (PLLR) extracted from different phonotactic recognizers that contributes to improve the accuracy of the system in a 21.4% in terms of Cavg (we also present results for the official metric during the evaluation, Fact). We will present how using these features at the phone state level provides significant improvements, when used together with dimensionality reduction techniques, especially PCA. We have also experimented with applying alternative SDC-like configurations on these PLLR features with additional improvements. Also, we will describe some modifications to the MFCC-based acoustic i-vector system which have also contributed to additional improvements. The final fused system outperformed the baseline in 27.4% in Cavg

    Wavelet-based techniques for speech recognition

    Get PDF
    In this thesis, new wavelet-based techniques have been developed for the extraction of features from speech signals for the purpose of automatic speech recognition (ASR). One of the advantages of the wavelet transform over the short time Fourier transform (STFT) is its capability to process non-stationary signals. Since speech signals are not strictly stationary the wavelet transform is a better choice for time-frequency transformation of these signals. In addition it has compactly supported basis functions, thereby reducing the amount of computation as opposed to STFT where an overlapping window is needed. [Continues.

    Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

    Get PDF
    Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

    Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion

    Get PDF
    Recently, audio segmentation has attracted research interest because of its usefulness in several applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio segmentation stage may be useful to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this article, we present the evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign. That evaluation consisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech over music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation task. In this article, after presenting the database and metric, as well as the feature extraction methods and segmentation techniques used by the submitted systems, the experimental results are analyzed and compared, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.Peer ReviewedPostprint (published version

    Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

    Get PDF
    Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

    Exploring I-vector based speaker age estimation

    Get PDF

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling

    Arabic Isolated Word Speaker Dependent Recognition System

    Get PDF
    In this thesis we designed a new Arabic isolated word speaker dependent recognition system based on a combination of several features extraction and classifications techniques. Where, the system combines the methods outputs using a voting rule. The system is implemented with a graphic user interface under Matlab using G62 Core I3/2.26 Ghz processor laptop. The dataset used in this system include 40 Arabic words recorded in a calm environment with 5 different speakers using laptop microphone. Each speaker will read each word 8 times. 5 of them are used in training and the remaining are used in the test phase. First in the preprocessing step we used an endpoint detection technique based on energy and zero crossing rates to identify the start and the end of each word and remove silences then we used a discrete wavelet transform to remove noise from signal. In order to accelerate the system and reduce the execution time we make the system first to recognize the speaker and load only the reference model of that user. We compared 5 different methods which are pairwise Euclidean distance with MelFrequency cepstral coefficients (MFCC), Dynamic Time Warping (DTW) with Formants features, Gaussian Mixture Model (GMM) with MFCC, MFCC+DTW and Itakura distance with Linear Predictive Coding features (LPC) and we got a recognition rate of 85.23%, 57% , 87%, 90%, 83% respectively. In order to improve the accuracy of the system, we tested several combinations of these 5 methods. We find that the best combination is MFCC | Euclidean + Formant | DTW + MFCC | DTW + LPC | Itakura with an accuracy of 94.39% but with large computation time of 2.9 seconds. In order to reduce the computation time of this hybrid, we compare several subcombination of it and find that the best performance in trade off computation time is by first combining MFCC | Euclidean + LPC | Itakura and only when the two methods do not match the system will add Formant | DTW + MFCC | DTW methods to the combination, where the average computation time is reduced to the half to 1.56 seconds and the system accuracy is improved to 94.56%. Finally, the proposed system is good and competitive compared with other previous researches
    corecore