2,206 research outputs found

    On adaptive decision rules and decision parameter adaptation for automatic speech recognition

    Get PDF
    Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling

    Get PDF
    We introduce a new adaptive Bayesian learning framework, called multiple-stream prior evolution and posterior pooling, for online adaptation of the continuous density hidden Markov model (CDHMM) parameters. Among three architectures we proposed for this framework, we study in detail a specific two stream system where linear transformations are applied to the mean vectors of the CDHMMs to control the evolution of their prior distribution. This new stream of prior distribution can be combined with another stream of prior distribution evolved without any constraints applied. In a series of speaker adaptation experiments on the task of continuous Mandarin speech recognition, we show that the new adaptation algorithm achieves a similar fast-adaptation performance as that of the incremental maximum likelihood linear regression (MLLR) in the case of small amount of adaptation data, while maintains the good asymptotic convergence property as that of our previously proposed quasi-Bayes adaptation algorithms.published_or_final_versio

    ๋น„ํ™”์ž ์š”์†Œ์— ๊ฐ•์ธํ•œ ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€๋‚จ์ˆ˜.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ตœ๊ทผ ๋ช‡๋…„๊ฐ„ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์œผ๋ฉฐ, ํ™”์ž ์ธ์‹์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ๊ณ ์ „์ ์ธ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์—์„œ ๋…น์Œ๋œ ์Œ์„ฑ๋“ค์„ ๋ถ„์„ํ•˜๋Š” ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค (e.g., GMM ์Šˆํผ๋ฒกํ„ฐ, i-๋ฒกํ„ฐ)์™€ ๋‹ฌ๋ฆฌ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™”๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” variational autoencoder (VAE) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ•ด๋‹น ๊ธฐ๋ฒ•์—์„œ๋Š” ์Œ์„ฑ ๋ถ„ํฌ ํŒจํ„ด์„ ์š”์•ฝํ•˜๋Š” ๋ฒกํ„ฐ์™€ ์Œ์„ฑ ๋‚ด์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ• (e.g., d-๋ฒกํ„ฐ, x-๋ฒกํ„ฐ)์™€๋Š” ๋‹ฌ๋ฆฌ, ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋น„๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™” ๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ VAE์˜ KL-divergence ์ œ์•ฝ ํ•จ์ˆ˜๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ VAE ๋ฐ ALI ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์€ ์งง์€ ์Œ์„ฑ์—์„œ์˜ ํ™”์ž ์ธ์ฆ ์‹คํ—˜์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-๋ฒกํ„ฐ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ฑ๋ฌธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋น„ ํ™”์ž ์š”์†Œ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ํ•™์Šต๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํ™”์ž ๋ฒกํ„ฐ์™€ ๋น„ํ™”์ž ๋ฒกํ„ฐ๋ฅผ ๋™์‹œ์— ์ถ”์ถœํ•˜๋ฉฐ, ๊ฐ ๋ฒกํ„ฐ๋Š” ์ž์‹ ์˜ ์ฃผ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์œ ์ง€ํ•˜๋˜, ๋ถ€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ธฐ์กด์˜ ๋น„ ํ™”์ž ์š”์†Œ ์ •๋ณด ์ œ๊ฑฐ ๊ธฐ๋ฒ•๋“ค (e.g., adversarial learning, gradient reversal)์— ๋น„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ํ•™์Šต ์ „๋žต์„ ์š”ํ•˜์ง€ ์•Š๊ธฐ์—, ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ RSR2015 Part3 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์„ฑ๋ฌธ ๋ฒกํ„ฐ ๋‚ด์˜ ๋…น์Œ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ • ์ •๋ณด๋ฅผ ์–ต์ œํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Self-supervised learning of a facial attribute embedding from video

    Full text link
    We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
    • โ€ฆ
    corecore