244 research outputs found

    Apprentissage automatique pour le codage cognitif de la parole

    Get PDF
    Depuis les années 80, les codecs vocaux reposent sur des stratégies de codage à court terme qui fonctionnent au niveau de la sous-trame ou de la trame (généralement 5 à 20 ms). Les chercheurs ont essentiellement ajusté et combiné un nombre limité de technologies disponibles (transformation, prédiction linéaire, quantification) et de stratégies (suivi de forme d'onde, mise en forme du bruit) pour construire des architectures de codage de plus en plus complexes. Dans cette thèse, plutôt que de s'appuyer sur des stratégies de codage à court terme, nous développons un cadre alternatif pour la compression de la parole en codant les attributs de la parole qui sont des caractéristiques perceptuellement importantes des signaux vocaux. Afin d'atteindre cet objectif, nous résolvons trois problèmes de complexité croissante, à savoir la classification, la prédiction et l'apprentissage des représentations. La classification est un élément courant dans les conceptions de codecs modernes. Dans un premier temps, nous concevons un classifieur pour identifier les émotions, qui sont parmi les attributs à long terme les plus complexes de la parole. Dans une deuxième étape, nous concevons un prédicteur d'échantillon de parole, qui est un autre élément commun dans les conceptions de codecs modernes, pour mettre en évidence les avantages du traitement du signal de parole à long terme et non linéaire. Ensuite, nous explorons les variables latentes, un espace de représentations de la parole, pour coder les attributs de la parole à court et à long terme. Enfin, nous proposons un réseau décodeur pour synthétiser les signaux de parole à partir de ces représentations, ce qui constitue notre dernière étape vers la construction d'une méthode complète de compression de la parole basée sur l'apprentissage automatique de bout en bout. Bien que chaque étape de développement proposée dans cette thèse puisse faire partie d'un codec à elle seule, chaque étape fournit également des informations et une base pour la prochaine étape de développement jusqu'à ce qu'un codec entièrement basé sur l'apprentissage automatique soit atteint. Les deux premières étapes, la classification et la prédiction, fournissent de nouveaux outils qui pourraient remplacer et améliorer des éléments des codecs existants. Dans la première étape, nous utilisons une combinaison de modèle source-filtre et de machine à état liquide (LSM), pour démontrer que les caractéristiques liées aux émotions peuvent être facilement extraites et classées à l'aide d'un simple classificateur. Dans la deuxième étape, un seul réseau de bout en bout utilisant une longue mémoire à court terme (LSTM) est utilisé pour produire des trames vocales avec une qualité subjective élevée pour les applications de masquage de perte de paquets (PLC). Dans les dernières étapes, nous nous appuyons sur les résultats des étapes précédentes pour concevoir un codec entièrement basé sur l'apprentissage automatique. un réseau d'encodage, formulé à l'aide d'un réseau neuronal profond (DNN) et entraîné sur plusieurs bases de données publiques, extrait et encode les représentations de la parole en utilisant la prédiction dans un espace latent. Une approche d'apprentissage non supervisé basée sur plusieurs principes de cognition est proposée pour extraire des représentations à partir de trames de parole courtes et longues en utilisant l'information mutuelle et la perte contrastive. La capacité de ces représentations apprises à capturer divers attributs de la parole à court et à long terme est démontrée. Enfin, une structure de décodage est proposée pour synthétiser des signaux de parole à partir de ces représentations. L'entraînement contradictoire est utilisé comme une approximation des mesures subjectives de la qualité de la parole afin de synthétiser des échantillons de parole à consonance naturelle. La haute qualité perceptuelle de la parole synthétisée ainsi obtenue prouve que les représentations extraites sont efficaces pour préserver toutes sortes d'attributs de la parole et donc qu'une méthode de compression complète est démontrée avec l'approche proposée.Abstract: Since the 80s, speech codecs have relied on short-term coding strategies that operate at the subframe or frame level (typically 5 to 20ms). Researchers essentially adjusted and combined a limited number of available technologies (transform, linear prediction, quantization) and strategies (waveform matching, noise shaping) to build increasingly complex coding architectures. In this thesis, rather than relying on short-term coding strategies, we develop an alternative framework for speech compression by encoding speech attributes that are perceptually important characteristics of speech signals. In order to achieve this objective, we solve three problems of increasing complexity, namely classification, prediction and representation learning. Classification is a common element in modern codec designs. In a first step, we design a classifier to identify emotions, which are among the most complex long-term speech attributes. In a second step, we design a speech sample predictor, which is another common element in modern codec designs, to highlight the benefits of long-term and non-linear speech signal processing. Then, we explore latent variables, a space of speech representations, to encode both short-term and long-term speech attributes. Lastly, we propose a decoder network to synthesize speech signals from these representations, which constitutes our final step towards building a complete, end-to-end machine-learning based speech compression method. The first two steps, classification and prediction, provide new tools that could replace and improve elements of existing codecs. In the first step, we use a combination of source-filter model and liquid state machine (LSM), to demonstrate that features related to emotions can be easily extracted and classified using a simple classifier. In the second step, a single end-to-end network using long short-term memory (LSTM) is shown to produce speech frames with high subjective quality for packet loss concealment (PLC) applications. In the last steps, we build upon the results of previous steps to design a fully machine learning-based codec. An encoder network, formulated using a deep neural network (DNN) and trained on multiple public databases, extracts and encodes speech representations using prediction in a latent space. An unsupervised learning approach based on several principles of cognition is proposed to extract representations from both short and long frames of data using mutual information and contrastive loss. The ability of these learned representations to capture various short- and long-term speech attributes is demonstrated. Finally, a decoder structure is proposed to synthesize speech signals from these representations. Adversarial training is used as an approximation to subjective speech quality measures in order to synthesize natural-sounding speech samples. The high perceptual quality of synthesized speech thus achieved proves that the extracted representations are efficient at preserving all sorts of speech attributes and therefore that a complete compression method is demonstrated with the proposed approach

    Tree-Structured Nonlinear Adaptive Signal Processing

    Get PDF
    In communication systems, nonlinear adaptive filtering has become increasingly popular in a variety of applications such as channel equalization, echo cancellation and speech coding. However, existing nonlinear adaptive filters such as polynomial (truncated Volterra series) filters and multilayer perceptrons suffer from a number of problems. First, although high Order polynomials can approximate complex nonlinearities, they also train very slowly. Second, there is no systematic and efficient way to select their structure. As for multilayer perceptrons, they have a very complicated structure and train extremely slowly Motivated by the success of classification and regression trees on difficult nonlinear and nonparametfic problems, we propose the idea of a tree-structured piecewise linear adaptive filter. In the proposed method each node in a tree is associated with a linear filter restricted to a polygonal domain, and this is done in such a way that each pruned subtree is associated with a piecewise linear filter. A training sequence is used to adaptively update the filter coefficients and domains at each node, and to select the best pruned subtree and the corresponding piecewise linear filter. The tree structured approach offers several advantages. First, it makes use of standard linear adaptive filtering techniques at each node to find the corresponding Conditional linear filter. Second, it allows for efficient selection of the subtree and the corresponding piecewise linear filter of appropriate complexity. Overall, the approach is computationally efficient and conceptually simple. The tree-structured piecewise linear adaptive filter bears some similarity to classification and regression trees. But it is actually quite different from a classification and regression tree. Here the terminal nodes are not just assigned a region and a class label or a regression value, but rather represent: a linear filter with restricted domain, It is also different in that classification and regression trees are determined in a batch mode offline, whereas the tree-structured adaptive filter is determined recursively in real-time. We first develop the specific structure of a tree-structured piecewise linear adaptive filter and derive a stochastic gradient-based training algorithm. We then carry out a rigorous convergence analysis of the proposed training algorithm for the tree-structured filter. Here we show the mean-square convergence of the adaptively trained tree-structured piecewise linear filter to the optimal tree-structured piecewise linear filter. Same new techniques are developed for analyzing stochastic gradient algorithms with fixed gains and (nonstandard) dependent data. Finally, numerical experiments are performed to show the computational and performance advantages of the tree-structured piecewise linear filter over linear and polynomial filters for equalization of high frequency channels with severe intersymbol interference, echo cancellation in telephone networks and predictive coding of speech signals

    Perceptual techniques in audio quality assessment

    Get PDF

    Reduced Fuzzy Recursive Least-Squares Algorithm For Real Time Estimation

    Get PDF
    Adaptive filtering is an online signal processing application that is capable of estimating parameters for the characterization of a real time system. The estimation is done based on the mean-square error (MSE) minimization of the difference between some desired output and the output of the adaptive filter. The unknown transfer function is assumed to be a known structure and is realized by three different structures which are commonly used in conventional adaptive filtering: (i) the finite-duration impulse response (FIR) filter, (ii) the infinite-duration impulse response (IIR) filter, and (iii) the nonlinear filter

    An investigation into glottal waveform based speech coding

    Get PDF
    Coding of voiced speech by extraction of the glottal waveform has shown promise in improving the efficiency of speech coding systems. This thesis describes an investigation into the performance of such a system. The effect of reverberation on the radiation impedance at the lips is shown to be negligible under normal conditions. Also, the accuracy of the Image Method for adding artificial reverberation to anechoic speech recordings is established. A new algorithm, Pre-emphasised Maximum Likelihood Epoch Detection (PMLED), for Glottal Closure Instant detection is proposed. The algorithm is tested on natural speech and is shown to be both accurate and robust. Two techniques for giottai waveform estimation, Closed Phase Inverse Filtering (CPIF) and Iterative Adaptive Inverse Filtering (IAIF), are compared. In tandem with an LF model fitting procedure, both techniques display a high degree of accuracy However, IAIF is found to be slightly more robust. Based on these results, a Glottal Excited Linear Predictive (GELP) coding system for voiced speech is proposed and tested. Using a differential LF parameter quantisation scheme, the system achieves speech quality similar to that of U S Federal Standard 1016 CELP at a lower mean bit rate while incurring no extra delay

    <strong>Non-Gaussian, Non-stationary and Nonlinear Signal Processing Methods - with Applications to Speech Processing and Channel Estimation</strong>

    Get PDF

    Machine Learning in Digital Signal Processing for Optical Transmission Systems

    Get PDF
    The future demand for digital information will exceed the capabilities of current optical communication systems, which are approaching their limits due to component and fiber intrinsic non-linear effects. Machine learning methods are promising to find new ways of leverage the available resources and to explore new solutions. Although, some of the machine learning methods such as adaptive non-linear filtering and probabilistic modeling are not novel in the field of telecommunication, enhanced powerful architecture designs together with increasing computing power make it possible to tackle more complex problems today. The methods presented in this work apply machine learning on optical communication systems with two main contributions. First, an unsupervised learning algorithm with embedded additive white Gaussian noise (AWGN) channel and appropriate power constraint is trained end-to-end, learning a geometric constellation shape for lowest bit-error rates over amplified and unamplified links. Second, supervised machine learning methods, especially deep neural networks with and without internal cyclical connections, are investigated to combat linear and non-linear inter-symbol interference (ISI) as well as colored noise effects introduced by the components and the fiber. On high-bandwidth coherent optical transmission setups their performances and complexities are experimentally evaluated and benchmarked against conventional digital signal processing (DSP) approaches. This thesis shows how machine learning can be applied to optical communication systems. In particular, it is demonstrated that machine learning is a viable designing and DSP tool to increase the capabilities of optical communication systems

    A non-linear polynomial approximation filter for robust speaker verification

    Get PDF
    Bibliography: leaves 101-109
    corecore