80 research outputs found
Towards a complete Binary Key System for the Speaker Diarization Task
International audienceSpeaker diarization is the task of partitioning an audio stream into homogeneous segments according to speaker identity. Today state-of-the-art speaker diarization systems have achieved very competitive performance. However, any small improvement in Diarization Error Rate (DER) is usually subject to very large processing times (real time factor above one), which makes systems not suitable for some time-critical, real-life applications. Recently, a novel fast speaker diarization technique based on speaker modeling using binary keys was presented. The proposed technique speeds up the process up to ten times faster than real-time with little increase of DER. Although the approach shows great potential, the presented results are still preliminary. The goal of this paper is to further investigate this technique, in order to move towards a complete binary-key based system for the speaker diarization task. Preliminary experiments in Speech Activity Detection (SAD) based on binary keys show the feasibility of the binary key modeling approach for this task. Furthermore, the system has been tested on two different kinds of test data: meeting audio recordings and TV shows. The experiments carried out on NIST RT05 and REPERE databases show promising results and indicate that there is still room for further improvement
Typicality extraction in a Speaker Binary Keys model
International audienceIn the field of speaker recognition, the recently proposed notion of "Speaker Binary Key" provides a representation of each acoustic frame in a discriminant binary space. This approach relies on an unique acoustic model composed by a large set of speaker specific local likelihood peaks (called specificities). The model proposes a spatial coverage where each frame is characterized in terms of neighborhood. The most frequent specificities, picked up to represent the whole utterance, generate a binary key vector. The flexibility of this modeling allows to capture non-parametric behaviors. In this paper, we introduce a concept of "typicality" between binary keys, with a discriminant goal. We describe an algorithm able to extract such typicalities, which involves a singular value decomposition in a binary space. The theoretical aspects of this decomposition as well as its potential in terms of future developments are presented. All the propositions are also experimentally validated using NIST SRE 2008 framework
Speaker Diarization
DisertaÄŤnĂ práce se zaměřuje na tĂ©ma diarizace Ĺ™eÄŤnĂkĹŻ, coĹľ je Ăşloha zpracovánĂ Ĺ™eÄŤi typicky charakterizovaná otázkou "Kdo kdy mluvĂ?". Práce se takĂ© zabĂ˝vá souvisejĂcĂ Ăşlohou detekce pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi, která je velmi relevantnĂ pro diarizaci.
Teoretická část práce poskytuje pĹ™ehled existujĂcĂch metod diarizace Ĺ™eÄŤnĂkĹŻ, a to jak tÄ›ch offline, tak online, a pĹ™ibliĹľuje nÄ›kolik problematickĂ˝ch oblastĂ, kterĂ© byly identifikovány v ranĂ© fázi autorÄŤina vĂ˝zkumu. V práci je takĂ© pĹ™edloĹľeno rozsáhlĂ© srovnánĂ existujĂcĂch systĂ©mĹŻ se zaměřenĂm na jejich uvádÄ›nĂ© vĂ˝sledky. Jedna kapitola se takĂ© zaměřuje na tĂ©ma pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi a na metody jejĂ detekce.
Experimentálnà část práce pĹ™edkládá praktickĂ© vĂ˝stupy, kterĂ˝ch bylo dosaĹľeno. Experimenty s diarizacĂ se zaměřovaly zejmĂ©na na online systĂ©m zaloĹľenĂ˝ na GMM a na i-vektorovĂ˝ systĂ©m, kterĂ˝ mÄ›l offline i online varianty. ZávÄ›reÄŤná sekce experimentĹŻ takĂ© pĹ™ibliĹľuje novÄ› navrĹľenou metodu pro detekci pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi, která je zaloĹľena na konvoluÄŤnĂ neuronovĂ© sĂti.ObhájenoThe thesis focuses on the topic of speaker diarization, a speech processing task that is commonly characterized as the question "Who speaks when?". It also addresses the related task of overlapping speech detection, which is very relevant for diarization.
The theoretical part of the thesis provides an overview of existing diarization approaches, both offline and online, and discusses some of the problematic areas which were identified in early stages of the author's research. The thesis also includes an extensive comparison of existing diarization systems, with focus on their reported performance. One chapter is also dedicated to the topic of overlapping speech and the methods of its detection.
The experimental part of the thesis then presents the work which has been done on speaker diarization, which was focused mostly on a GMM-based online diarization system and an i-vector based system with both offline and online variants. The final section also details a newly proposed approach for detecting overlapping speech using a convolutional neural network
Data-Driven Representation Learning in Multimodal Feature Fusion
abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction.
We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems.
In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
- …