Search CORE

51 research outputs found

Gaussian Mixture Model-based Quantization of Line Spectral Frequencies for Adaptive Multirate Speech Codec

Author: Davor Petrinović
Tihomir Tadić
Publication venue: 'University of Zagreb - University Computing Centre'
Publication date: 01/01/2011
Field of study

In this paper, we investigate the use of a Gaussian MixtureModel (GMM)-based quantizer for quantization of the Line Spectral Frequencies (LSFs) in the Adaptive Multi-Rate (AMR) speech codec. We estimate the parametric GMM model of the probability density function (pdf) for the prediction error (residual) of mean-removed LSF parameters that are used in the AMR codec for speech spectral envelope representation. The studied GMM-based quantizer is based on transform coding using Karhunen-Loeve transform (KLT) and transform domain scalar quantizers (SQ) individually designed for each Gaussian mixture. We have investigated the applicability of such a quantization scheme in the existing AMR codec by solely replacing the AMR LSF quantization algorithm segment. The main novelty in this paper lies in applying and adapting the entropy constrained (EC) coding for fixed-rate scalar quantization of transformed residuals thereby allowing for better adaptation to the local statistics of the source. We study and evaluate the compression efficiency, computational complexity and memory requirements of the proposed algorithm. Experimental results show that the GMM-based EC quantizer provides better rate/distortion performance than the quantization schemes used in the referent AMR codec by saving up to 7.32 bits/frame at much lower rate-independent computational complexity and memory requirements

A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

Author: Nurminen Jani
Publication venue: Tampere University of Technology
Publication date: 01/01/2013
Field of study

During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

Apparatus And Quality Enhancement Algorithm For Mixed Excitation Linear Predictive (MELP) And Other Speech Coders

Author
Publication venue
Publication date
Field of study

A system and method for enhancing the speech quality of the mixed excitation linear predictive (MELP) coder and other low bit-rate speech coders. The system and method employ a plosive analysis/synthesis method, which detects the frame containing a plosive signal, applies a simple model to synthesize the plosive signal, and adds the synthesized plosive to the coded speech. The system and method remains compatible with the existing MELP coder bit stream.Georgia-tech Research Corporatio

Energy Based Split Vector Quantizer Employing Signal Representation in Multiple Transform Domains.

Author: Krishnan Venkatesh
Mikhael Wasfy
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 18/12/2007
Field of study

This invention relates to representation of one and multidimensional signal vectors in nonorgothonal domains and design of Vector Quantizers that can be chosen among these representations. There is presented a Vector Quantization technique in multiple nonorthogonal domains for both waveform and model based signal characterization. An iterative codebook accuracy enhancement algorithm, applicable to both waveform and model based Vector Quantization in multiple nonorthogonal domains, which yields further improvement in signal coding performance, is disclosed. Further, Vector Quantization in in nonorthogonal domains is applied to speech and exhibits clear performance improvements of reconstruction quality for the same bit rate compared to existing single domain Vector Quantization techniques. The technique disclosed herein can be easily extended to several other one and multidimensional signal classes

Spectral Envelope Modelling for Full-Band Speech Coding

Author: Moradiashour Chamran
Publication venue
Publication date: 12/12/2016
Field of study

Speech coding considering historically narrow-band was in the latest years significantly improved by widening the coded audio bandwidth. However, existing speech coders still employ a limited band source-filter model extended by parametric coding of the higher band. In this thesis, a full-band source-filter model is considered and especially its spectral magnitude envelope modelling. To match full-band operating mode, we modified, tuned and compared two methods, Linear Predictive Coding (LPC) and Distribution Quantization (DQ). LPC uses autoregressive modeling, while DQ quantifies the energy ratios between parts of the spectrum. Parameters of both methods were quantized with multi-stage vector quantization. Objective and subjective evaluations indicate the two methods used in a full-band source-filter coding scheme perform on the same range and are competitive against conventional speech coders requiring an extra bandwidth extension

Error Correction For Automotive Telematics Systems

Author: Zakhem Samer
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2010
Field of study

One benefit of data communication over the voice channel of the cellular network is to reliably transmit real-time high priority data in case of life critical situations. An important implementation of this use-case is the pan-European eCall automotive standard, which has already been deployed since 2018. This is the first international standard for mobile emergency call that was adopted by multiple regions in Europe and the world. Other countries in the world are currently working on deploying a similar emergency communication system, such as in Russia and China. Moreover, many experiments and road tests are conducted yearly to validate and improve the requirements of the system. The results have proven that the requirements are unachievable thus far, with a success rate of emergency data delivery of only 70%. The eCall in-band modem transmits emergency information from the in-vehicle system (IVS) over the voice channel of the circuit switch real time communication system to the public safety answering point (PSAP) in case of a collision. The voice channel is characterized by the non-linear vocoder which is designed to compress speech waveforms. In addition, multipath fading, caused by the surrounding buildings and hills, results in severe signal distortion and causes delays in the transmission of the emergency information. Therefore, to reliably transmit data over the voice channels, the in-band modem modulates the data into speech-like (SL) waveforms, and employs a powerful forward error correcting (FEC) code to secure the real-time transmission. In this dissertation, the Turbo coded performance of the eCall in-band modem is first evaluated through the adaptive white Gaussian noise (AWGN) channel and the adaptive multi-rate (AMR) voice channel. The modulation used is biorthogonal pulse position modulation (BPPM). Simulations are conducted for both the fast and robust eCall modem. The results show that the distortion added by the vocoder is significantly large and degrades the system performance. In addition, the robust modem performs better than the fast modem. For instance, to achieve a bit error rate (BER) of 10^{-6} using the AMR compression rate of 7.4 kbps, the signal-to-noise ratio (SNR) required is 5.5 dB for the robust modem while a SNR of 7.5 dB is required for the fast modem. On the other hand, the fading effect is studied in the eCall channel. It was shown that the fading distribution does not follow a Rayleigh distribution. The performance of the in-band modem is evaluated through the AWGN, AMR and fading channel. The results are compared with a Rayleigh fading channel. The analysis shows that strong fading still exists in the voice channel after power control. The results explain the large delays and failure of the emergency data transmission to the PSAP. Thus, the eCall standard needs to re-evaluate their requirements in order to consider the impact of fading on the transmission of the modulated signals. The results can be directly applied to design real-time emergency communication systems, including modulation and coding

Directory of Open Access Journals

Digital Commons@Wayne State University

Estimation of Frame Independent and Enhancement Components for Speech Communication over Packet Networks

Author: Christensen Mads Græsbøll
Giacobello Daniele
Jensen Søren Holdt
Moonen Marc
Murthi Manohar N.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

In this paper, we describe a new approach to cope with packet loss in speech coders. The idea is to split the information present in each speech packet into two components, one to independently decode the given speech frame and one to enhance it by exploiting interframe dependencies. The scheme is based on sparse linear prediction and a redeﬁnition of the analysis-by-synthesis process. We presentMean Opinion Scores for the presented coder with different degrees of packet loss and show that it performs similarly to frame dependent coders for low packet loss probability and similarly to frame independent coders for high packet loss probability. We also present ideas on how to make the coder work synergistically with the channel loss estimate

University of Miami: Scholarship Miami

VBN

Error Correction For Automotive Telematics Systems

Author: Zakhem Samer
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2020
Field of study

Digital Commons@Wayne State University

Recommended from our members

Speech coding

Author: Ravishankar C., Hughes Network Systems, Germantown, MD
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 08/05/1998
Field of study

Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the coding techniques are equally applicable to any voice signal whether or not it carries any intelligible information, as the term speech implies. Other terms that are commonly used are speech compression and voice compression since the fundamental idea behind speech coding is to reduce (compress) the transmission rate (or equivalently the bandwidth) And/or reduce storage requirements In this document the terms speech and voice shall be used interchangeably

UNT Digital Library

Exploiting deep learning in limited-fronthaul cell-free massive MIMO uplink

Author: Akbari Ali
Bashar Manijeh
Burr Alister G.
Cumanan Kanapathippillai
Debbah Merouane
Kittler Josef
Ngo Hien-Quoc
Xiao Pei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/02/2020
Field of study

A cell-free massive multiple-input multiple-output (MIMO) uplink is considered, where quantize-and-forward (QF) refers to the case where both the channel estimates and the received signals are quantized at the access points (APs) and forwarded to a central processing unit (CPU) whereas in combine-quantize-and-forward (CQF), the APs send the quantized version of the combined signal to the CPU. To solve the non-convex sum rate maximization problem, a heuristic sub-optimal scheme is exploited to convert the power allocation problem into a standard geometric programme (GP). We exploit the knowledge of the channel statistics to design the power elements. Employing large-scale fading (LSF) with a deep convolutional neural network (DCNN) enables us to determine a mapping from the LSF coefficients and the optimal power through solving the sum rate maximization problem using the quantized channel. Four possible power control schemes are studied, which we refer to as i) small-scale fading (SSF)-based QF; ii) LSF-based CQF; iii) LSF use-and-then-forget (UatF)-based QF; and iv) LSF deep learning (DL)-based QF, according to where channel estimation is performed and exploited and how the optimization problem is solved. Numerical results show that for the same fronthaul rate, the throughput significantly increases thanks to the mapping obtained using DCNN

Surrey Research Insight