38 research outputs found
Spectral Envelope Modelling for Full-Band Speech Coding
Speech coding considering historically narrow-band was in the latest years significantly improved by widening the coded audio bandwidth. However, existing speech coders still employ a limited band source-filter model extended by parametric coding of the higher band. In this thesis, a full-band source-filter model is considered and especially its spectral magnitude envelope modelling.
To match full-band operating mode, we modified, tuned and compared two methods, Linear Predictive Coding (LPC) and Distribution Quantization (DQ). LPC uses autoregressive modeling, while DQ quantifies the energy ratios between parts of the spectrum. Parameters of both methods were quantized with multi-stage vector quantization. Objective and subjective evaluations indicate the two methods used in a full-band source-filter coding scheme perform on the same range and are competitive against conventional speech coders requiring an extra bandwidth extension
Nouvelles techniques de quantification vectorielle algébrique basées sur le codage de Voronoi : application au codage AMR-WB+
L'objet de cette thèse est l'étude de la quantification (vectorielle) par réseau de points et de son application au modèle de codage audio ACELP/TCX multi-mode. Le modèle ACELP/TCX constitue une solution possible au problème du codage audio universel---par codage universel, on entend la représentation unifiée de bonne qualité des signaux de parole et de musique à différents débits et fréquences d'échantillonnage. On considère ici comme applications la quantification des coefficients de prédiction linéaire et surtout le codage par transformée au sein du modèle TCX; l'application au codage TCX a un fort intérêt pratique, car le modèle TCX conditionne en grande partie le caractère universel du codage ACELP/TCX. La quantification par réseau de points est une technique de quantification par contrainte, exploitant la structure linéaire des réseaux réguliers. Elle a toujours été considérée, par rapport à la quantification vectorielle non structurée, comme une technique prometteuse du fait de sa complexité réduite (en stockage et quantité de calculs). On montre ici qu'elle possède d'autres avantages importants: elle rend possible la construction de codes efficaces en dimension relativement élevée et à débit arbitrairement élevé, adaptés au codage multi-débit (par transformée ou autre); en outre, elle permet de ramener la distorsion à la seule erreur granulaire au prix d'un codage à débit variable. Plusieurs techniques de quantification par réseau de points sont présentées dans cette thèse. Elles sont toutes élaborées à partir du codage de Voronoï. Le codage de Voronoï quasi-ellipsoïdal est adapté au codage d'une source gaussienne vectorielle dans le contexte du codage paramétrique de coefficients de prédiction linéaire à l'aide d'un modèle de mélange gaussien. La quantification vectorielle multi-débit par extension de Voronoï ou par codage de Voronoï à troncature adaptative est adaptée au codage audio par transformée multi-débit. L'application de la quantification vectorielle multi-débit au codage TCX est plus particulièrement étudiée. Une nouvelle technique de codage algébrique de la cible TCX est ainsi conçue à partir du principe d'allocation des bits par remplissage inverse des eaux
Recommended from our members
Speech coding
Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the coding techniques are equally applicable to any voice signal whether or not it carries any intelligible information, as the term speech implies. Other terms that are commonly used are speech compression and voice compression since the fundamental idea behind speech coding is to reduce (compress) the transmission rate (or equivalently the bandwidth) And/or reduce storage requirements In this document the terms speech and voice shall be used interchangeably
Apprentissage automatique pour le codage cognitif de la parole
Depuis les années 80, les codecs vocaux reposent sur des stratégies de codage à court terme qui fonctionnent au niveau de la sous-trame ou de la trame (généralement 5 à 20 ms). Les chercheurs ont essentiellement ajusté et combiné un nombre limité de technologies disponibles (transformation, prédiction linéaire, quantification) et de stratégies (suivi de forme d'onde, mise en forme du bruit) pour construire des architectures de codage de plus en plus complexes.
Dans cette thèse, plutôt que de s'appuyer sur des stratégies de codage à court terme, nous développons un cadre alternatif pour la compression de la parole en codant les attributs de la parole qui sont des caractéristiques perceptuellement importantes des signaux vocaux. Afin d'atteindre cet objectif, nous résolvons trois problèmes de complexité croissante, à savoir la classification, la prédiction et l'apprentissage des représentations. La classification est un élément courant dans les conceptions de codecs modernes. Dans un premier temps, nous concevons un classifieur pour identifier les émotions, qui sont parmi les attributs à long terme les plus complexes de la parole. Dans une deuxième étape, nous concevons un prédicteur d'échantillon de parole, qui est un autre élément commun dans les conceptions de codecs modernes, pour mettre en évidence les avantages du traitement du signal de parole à long terme et non linéaire. Ensuite, nous explorons les variables latentes, un espace de représentations de la parole, pour coder les attributs de la parole à court et à long terme. Enfin, nous proposons un réseau décodeur pour synthétiser les signaux de parole à partir de ces représentations, ce qui constitue notre dernière étape vers la construction d'une méthode complète de compression de la parole basée sur l'apprentissage automatique de bout en bout.
Bien que chaque étape de développement proposée dans cette thèse puisse faire partie d'un codec à elle seule, chaque étape fournit également des informations et une base pour la prochaine étape de développement jusqu'à ce qu'un codec entièrement basé sur l'apprentissage automatique soit atteint.
Les deux premières étapes, la classification et la prédiction, fournissent de nouveaux outils qui pourraient remplacer et améliorer des éléments des codecs existants. Dans la première étape, nous utilisons une combinaison de modèle source-filtre et de machine à état liquide (LSM), pour démontrer que les caractéristiques liées aux émotions peuvent être facilement extraites et classées à l'aide d'un simple classificateur. Dans la deuxième étape, un seul réseau de bout en bout utilisant une longue mémoire à court terme (LSTM) est utilisé pour produire des trames vocales avec une qualité subjective élevée pour les applications de masquage de perte de paquets (PLC).
Dans les dernières étapes, nous nous appuyons sur les résultats des étapes précédentes pour concevoir un codec entièrement basé sur l'apprentissage automatique. un réseau d'encodage, formulé à l'aide d'un réseau neuronal profond (DNN) et entraîné sur plusieurs bases de données publiques, extrait et encode les représentations de la parole en utilisant la prédiction dans un espace latent. Une approche d'apprentissage non supervisé basée sur plusieurs principes de cognition est proposée pour extraire des représentations à partir de trames de parole courtes et longues en utilisant l'information mutuelle et la perte contrastive. La capacité de ces représentations apprises à capturer divers attributs de la parole à court et à long terme est démontrée.
Enfin, une structure de décodage est proposée pour synthétiser des signaux de parole à partir de ces représentations. L'entraînement contradictoire est utilisé comme une approximation des mesures subjectives de la qualité de la parole afin de synthétiser des échantillons de parole à consonance naturelle. La haute qualité perceptuelle de la parole synthétisée ainsi obtenue prouve que les représentations extraites sont efficaces pour préserver toutes sortes d'attributs de la parole et donc qu'une méthode de compression complète est démontrée avec l'approche proposée.Abstract: Since the 80s, speech codecs have relied on short-term coding strategies that operate at the subframe or frame level (typically 5 to 20ms). Researchers essentially adjusted and combined a limited number of available technologies (transform, linear prediction, quantization) and strategies (waveform matching, noise shaping) to build increasingly complex coding architectures. In this thesis, rather than relying on short-term coding strategies, we develop an alternative framework for speech compression by encoding speech attributes that are perceptually important characteristics of speech signals. In order to achieve this objective, we solve three problems of increasing complexity, namely classification, prediction and representation learning. Classification is a common element in modern codec designs. In a first step, we design a classifier to identify emotions, which are among the most complex long-term speech attributes. In a second step, we design a speech sample predictor, which is another common element in modern codec designs, to highlight the benefits of long-term and non-linear speech signal processing. Then, we explore latent variables, a space of speech representations, to encode both short-term and long-term speech attributes. Lastly, we propose a decoder network to synthesize speech signals from these representations, which constitutes our final step towards building a complete, end-to-end machine-learning based speech compression method. The first two steps, classification and prediction, provide new tools that could replace and improve elements of existing codecs. In the first step, we use a combination of source-filter model and liquid state machine (LSM), to demonstrate that features related to emotions can be easily extracted and classified using a simple classifier. In the second step, a single end-to-end network using long short-term memory (LSTM) is shown to produce speech frames with high subjective quality for packet loss concealment (PLC) applications. In the last steps, we build upon the results of previous steps to design a fully machine learning-based codec. An encoder network, formulated using a deep neural network (DNN) and trained on multiple public databases, extracts and encodes speech representations using prediction in a latent space. An unsupervised learning approach based on several principles of cognition is proposed to extract representations from both short and long frames of data using mutual information and contrastive loss. The ability of these learned representations to capture various short- and long-term speech attributes is demonstrated. Finally, a decoder structure is proposed to synthesize speech signals from these representations. Adversarial training is used as an approximation to subjective speech quality measures in order to synthesize natural-sounding speech samples. The high perceptual quality of synthesized speech thus achieved proves that the extracted representations are efficient at preserving all sorts of speech attributes and therefore that a complete compression method is demonstrated with the proposed approach
Excitação multi-taxa usando quantização vetorial estruturada em árvore para o codificador CS-ACELP com aplicação em VoIP
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro TecnolĂłgico. Programa de PĂłs-Graduação em Engenharia ElĂ©trica.Este trabalho apresenta um estudo sobre codificação multi-taxa estruturada sobre o algoritmo CS-ACELP (Conjugate-Structure Algebraic-Code-Excited Linear-Prediction) e a especificação G.729, cujo objetivo Ă© propor um codificador com taxa variável, atravĂ©s da busca da melhor excitação fixa usando codebook estruturado em árvore, para aplicações VoIP (Voice-over-IP). A mudança progressiva do transporte de voz das redes de circuito para as redes IP (Internet Protocol), apesar dos diversos aspectos positivos, tem exposto algumas deficiĂŞncias intrĂnsecas destas, mais apropriadas ao tráfego de #melhor esforço# do que ao tráfego com requisitos de tempo. Esta proposta está inserida no conjunto das iniciativas, no âmbito do transmissor, que procuram minimizar os efeitos danosos da rede sobre a qualidade da voz reconstruĂda. O codebook proposto tem estrutura em árvore binária, concebida a partir de uma heurĂstica onde os vetores CS-ACELP sĂŁo ordenados por valor de forma decrescente. Uma estratĂ©gia particular de armazenamento dos nĂłs, envolvendo simplificação nos centrĂłides, codificação diferencial e geração automática dos dois Ăşltimos nĂveis da árvore, permite reduzir o espaço de armazenamento de 640 para apenas 7 kwords. AtravĂ©s deste modelo chega-se a 13 taxas de codificação, de 5,6 a 8,0 kbit/s, com passo de 0,2 kbit/s. A relação sinal ruĂdo fica em 1,5 dB abaixo da mesma medida na especificação G.729 para a taxa de 5,6 kbit/s, e apenas 0,6 dB abaixo quando na taxa 8,0 kbit/s. Testes subjetivos mostraram uma qualidade bastante aceitável para a taxa mĂnima e praticamente indistinguĂvel do codec original na taxa máxima. AlĂ©m disso, a busca da melhor excitação Ă© 2,4 vezes mais rápida em comparação ao codec G.729 e pode ser totalmente compatĂvel com este se a taxa for fixa em 8,0 kbit/s. This work presents a study about multi-rate coding structured over CS-ACELP (Conjugate-Structure Algebraic-Code-Excited Linear-Prediction) algorithm and G.729 standard, whose purpose is to come up with a variable rate codec by means of best fixed excitation search using a tree structured codebook, for VoIP (Voice-over-IP) applications. The progressive change of voice transmission from circuit switched to IP (Internet orks, besides its many positive aspects, has exposed some natural deficiencies of the latter, better suited to best effort traffics than traffics with time requirements. This proposition can be inserted in the bunch of efforts, related to the sender, that seek to reduce the network impairments over the quality of reconstructed voice. The suggested codebook has a binary tree structure heuristically conceived where algebraic CSACELP vectors are disposed by value in a decreasing order. Additionally, a particular approach to store the tree nodes are considered, which involves centroid implification, differential coding and automatic generation of the last two layers of the tree, squeezing the storing space from 640 down to 7 kwords. Through this model we reach 13 coding rates, ranging from 5.6 to 8.0 kbit/s, with 0.2 kbit/s step. The signal-to-noise ratio is 1.5 dB below the same measure for G.729 standard at the rate 5.6 kbit/s, and just 0.6 dB lower at 8.0 kbit/s. Subjective tests pointed to an acceptable quality at minimum rate and virtually indistinguishable quality from the original codec at the maximum one. Also, searching for the best fixed excitation is 2.4 times faster than G.729 and can be truly compatible with it if the rate is fixed in 8 kbit/s
Codec Detection from Speech
Tato práce se zabĂ˝vá detekcĂ kodekĹŻ z komprimovanĂ©ho Ĺ™eÄŤovĂ©ho signálu. CĂlem bylo zjistit, jakĂ© charakteristiky rozlišujĂ jednotlivĂ© kodeky a následnÄ› vytvoĹ™it prostĹ™edĂ vhodnĂ© pro experimenty s rĹŻznĂ˝mi typy a konfiguracemi klasifikátorĹŻ. PouĹľity byly Support vector machines a pĹ™edevšĂm neuronovĂ© sĂtÄ›, kterĂ© byly vytvoĹ™eny pomocĂ nástroje Keras. HlavnĂm pĹ™Ănosem tĂ©to práce je experimentálnà část, ve kterĂ© je analyzován vliv rĹŻznĂ˝ch parametrĹŻ neuronovĂ© sĂtÄ›. Po nalezenĂ nejvhodnÄ›jšà kombinace parametrĹŻ dosáhla sĂĹĄ pĹ™esnosti klasifikace pĹ™es 98% na testovacĂ sadÄ› obsahujĂcĂ data z 6 kodekĹŻ.This thesis deals with codec detection from compressed speech signal. The primary goal was to identify which features distinguish selected codecs, and then create an environment facilitating experiments with various types of classifiers and their configurations. Support vector machines and neural networks, modeled using the Keras library, were used. The main contribution of this work is the experimental part, in which the effects of the neural networks parameters are discussed. After tuning the parameters and finding their optimal values, the network achieved accuracy over 98% on a test set comprising data from six different codecs.
Frequency Domain Methods for Coding the Linear Predictive Residual of Speech Signals
The most frequently used speech coding paradigm is ACELP, famous because it encodes speech with high quality, while consuming a small bandwidth. ACELP performs linear prediction filtering in order to eliminate the effect of the spectral envelope from the signal. The noise-like excitation is then encoded using algebraic codebooks. The search of this codebook, however, can not be performed optimally with conventional encoders due to the correlation between their samples. Because of this, more complex algorithms are required in order to maintain the quality. Four different transformation algorithms have been implemented (DCT, DFT, Eigenvalue decomposition and Vandermonde decomposition) in order to decorrelate the samples of the innovative excitation in ACELP. These transformations have been integrated in the ACELP of the EVS codec. The transformed innovative excitation is coded using the envelope based arithmetic coder. Objective and subjective tests have been carried out to evaluate the quality of the encoding, the degree of decorrelation achieved by the transformations and the computational complexity of the algorithms