51 research outputs found

    Gaussian Mixture Model-based Quantization of Line Spectral Frequencies for Adaptive Multirate Speech Codec

    Get PDF
    In this paper, we investigate the use of a Gaussian MixtureModel (GMM)-based quantizer for quantization of the Line Spectral Frequencies (LSFs) in the Adaptive Multi-Rate (AMR) speech codec. We estimate the parametric GMM model of the probability density function (pdf) for the prediction error (residual) of mean-removed LSF parameters that are used in the AMR codec for speech spectral envelope representation. The studied GMM-based quantizer is based on transform coding using Karhunen-Loeve transform (KLT) and transform domain scalar quantizers (SQ) individually designed for each Gaussian mixture. We have investigated the applicability of such a quantization scheme in the existing AMR codec by solely replacing the AMR LSF quantization algorithm segment. The main novelty in this paper lies in applying and adapting the entropy constrained (EC) coding for fixed-rate scalar quantization of transformed residuals thereby allowing for better adaptation to the local statistics of the source. We study and evaluate the compression efficiency, computational complexity and memory requirements of the proposed algorithm. Experimental results show that the GMM-based EC quantizer provides better rate/distortion performance than the quantization schemes used in the referent AMR codec by saving up to 7.32 bits/frame at much lower rate-independent computational complexity and memory requirements

    A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

    Get PDF
    During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

    Apparatus And Quality Enhancement Algorithm For Mixed Excitation Linear Predictive (MELP) And Other Speech Coders

    Get PDF
    A system and method for enhancing the speech quality of the mixed excitation linear predictive (MELP) coder and other low bit-rate speech coders. The system and method employ a plosive analysis/synthesis method, which detects the frame containing a plosive signal, applies a simple model to synthesize the plosive signal, and adds the synthesized plosive to the coded speech. The system and method remains compatible with the existing MELP coder bit stream.Georgia-tech Research Corporatio

    Energy Based Split Vector Quantizer Employing Signal Representation in Multiple Transform Domains.

    Get PDF
    This invention relates to representation of one and multidimensional signal vectors in nonorgothonal domains and design of Vector Quantizers that can be chosen among these representations. There is presented a Vector Quantization technique in multiple nonorthogonal domains for both waveform and model based signal characterization. An iterative codebook accuracy enhancement algorithm, applicable to both waveform and model based Vector Quantization in multiple nonorthogonal domains, which yields further improvement in signal coding performance, is disclosed. Further, Vector Quantization in in nonorthogonal domains is applied to speech and exhibits clear performance improvements of reconstruction quality for the same bit rate compared to existing single domain Vector Quantization techniques. The technique disclosed herein can be easily extended to several other one and multidimensional signal classes

    Spectral Envelope Modelling for Full-Band Speech Coding

    Get PDF
    Speech coding considering historically narrow-band was in the latest years significantly improved by widening the coded audio bandwidth. However, existing speech coders still employ a limited band source-filter model extended by parametric coding of the higher band. In this thesis, a full-band source-filter model is considered and especially its spectral magnitude envelope modelling. To match full-band operating mode, we modified, tuned and compared two methods, Linear Predictive Coding (LPC) and Distribution Quantization (DQ). LPC uses autoregressive modeling, while DQ quantifies the energy ratios between parts of the spectrum. Parameters of both methods were quantized with multi-stage vector quantization. Objective and subjective evaluations indicate the two methods used in a full-band source-filter coding scheme perform on the same range and are competitive against conventional speech coders requiring an extra bandwidth extension

    Error Correction For Automotive Telematics Systems

    Get PDF
    One benefit of data communication over the voice channel of the cellular network is to reliably transmit real-time high priority data in case of life critical situations. An important implementation of this use-case is the pan-European eCall automotive standard, which has already been deployed since 2018. This is the first international standard for mobile emergency call that was adopted by multiple regions in Europe and the world. Other countries in the world are currently working on deploying a similar emergency communication system, such as in Russia and China. Moreover, many experiments and road tests are conducted yearly to validate and improve the requirements of the system. The results have proven that the requirements are unachievable thus far, with a success rate of emergency data delivery of only 70%. The eCall in-band modem transmits emergency information from the in-vehicle system (IVS) over the voice channel of the circuit switch real time communication system to the public safety answering point (PSAP) in case of a collision. The voice channel is characterized by the non-linear vocoder which is designed to compress speech waveforms. In addition, multipath fading, caused by the surrounding buildings and hills, results in severe signal distortion and causes delays in the transmission of the emergency information. Therefore, to reliably transmit data over the voice channels, the in-band modem modulates the data into speech-like (SL) waveforms, and employs a powerful forward error correcting (FEC) code to secure the real-time transmission. In this dissertation, the Turbo coded performance of the eCall in-band modem is first evaluated through the adaptive white Gaussian noise (AWGN) channel and the adaptive multi-rate (AMR) voice channel. The modulation used is biorthogonal pulse position modulation (BPPM). Simulations are conducted for both the fast and robust eCall modem. The results show that the distortion added by the vocoder is significantly large and degrades the system performance. In addition, the robust modem performs better than the fast modem. For instance, to achieve a bit error rate (BER) of 10^{-6} using the AMR compression rate of 7.4 kbps, the signal-to-noise ratio (SNR) required is 5.5 dB for the robust modem while a SNR of 7.5 dB is required for the fast modem. On the other hand, the fading effect is studied in the eCall channel. It was shown that the fading distribution does not follow a Rayleigh distribution. The performance of the in-band modem is evaluated through the AWGN, AMR and fading channel. The results are compared with a Rayleigh fading channel. The analysis shows that strong fading still exists in the voice channel after power control. The results explain the large delays and failure of the emergency data transmission to the PSAP. Thus, the eCall standard needs to re-evaluate their requirements in order to consider the impact of fading on the transmission of the modulated signals. The results can be directly applied to design real-time emergency communication systems, including modulation and coding

    Estimation of Frame Independent and Enhancement Components for Speech Communication over Packet Networks

    Get PDF
    In this paper, we describe a new approach to cope with packet loss in speech coders. The idea is to split the information present in each speech packet into two components, one to independently decode the given speech frame and one to enhance it by exploiting interframe dependencies. The scheme is based on sparse linear prediction and a redefinition of the analysis-by-synthesis process. We presentMean Opinion Scores for the presented coder with different degrees of packet loss and show that it performs similarly to frame dependent coders for low packet loss probability and similarly to frame independent coders for high packet loss probability. We also present ideas on how to make the coder work synergistically with the channel loss estimate

    Error Correction For Automotive Telematics Systems

    Get PDF
    One benefit of data communication over the voice channel of the cellular network is to reliably transmit real-time high priority data in case of life critical situations. An important implementation of this use-case is the pan-European eCall automotive standard, which has already been deployed since 2018. This is the first international standard for mobile emergency call that was adopted by multiple regions in Europe and the world. Other countries in the world are currently working on deploying a similar emergency communication system, such as in Russia and China. Moreover, many experiments and road tests are conducted yearly to validate and improve the requirements of the system. The results have proven that the requirements are unachievable thus far, with a success rate of emergency data delivery of only 70%. The eCall in-band modem transmits emergency information from the in-vehicle system (IVS) over the voice channel of the circuit switch real time communication system to the public safety answering point (PSAP) in case of a collision. The voice channel is characterized by the non-linear vocoder which is designed to compress speech waveforms. In addition, multipath fading, caused by the surrounding buildings and hills, results in severe signal distortion and causes delays in the transmission of the emergency information. Therefore, to reliably transmit data over the voice channels, the in-band modem modulates the data into speech-like (SL) waveforms, and employs a powerful forward error correcting (FEC) code to secure the real-time transmission. In this dissertation, the Turbo coded performance of the eCall in-band modem is first evaluated through the adaptive white Gaussian noise (AWGN) channel and the adaptive multi-rate (AMR) voice channel. The modulation used is biorthogonal pulse position modulation (BPPM). Simulations are conducted for both the fast and robust eCall modem. The results show that the distortion added by the vocoder is significantly large and degrades the system performance. In addition, the robust modem performs better than the fast modem. For instance, to achieve a bit error rate (BER) of 10^{-6} using the AMR compression rate of 7.4 kbps, the signal-to-noise ratio (SNR) required is 5.5 dB for the robust modem while a SNR of 7.5 dB is required for the fast modem. On the other hand, the fading effect is studied in the eCall channel. It was shown that the fading distribution does not follow a Rayleigh distribution. The performance of the in-band modem is evaluated through the AWGN, AMR and fading channel. The results are compared with a Rayleigh fading channel. The analysis shows that strong fading still exists in the voice channel after power control. The results explain the large delays and failure of the emergency data transmission to the PSAP. Thus, the eCall standard needs to re-evaluate their requirements in order to consider the impact of fading on the transmission of the modulated signals. The results can be directly applied to design real-time emergency communication systems, including modulation and coding

    Exploiting deep learning in limited-fronthaul cell-free massive MIMO uplink

    Get PDF
    A cell-free massive multiple-input multiple-output (MIMO) uplink is considered, where quantize-and-forward (QF) refers to the case where both the channel estimates and the received signals are quantized at the access points (APs) and forwarded to a central processing unit (CPU) whereas in combine-quantize-and-forward (CQF), the APs send the quantized version of the combined signal to the CPU. To solve the non-convex sum rate maximization problem, a heuristic sub-optimal scheme is exploited to convert the power allocation problem into a standard geometric programme (GP). We exploit the knowledge of the channel statistics to design the power elements. Employing large-scale fading (LSF) with a deep convolutional neural network (DCNN) enables us to determine a mapping from the LSF coefficients and the optimal power through solving the sum rate maximization problem using the quantized channel. Four possible power control schemes are studied, which we refer to as i) small-scale fading (SSF)-based QF; ii) LSF-based CQF; iii) LSF use-and-then-forget (UatF)-based QF; and iv) LSF deep learning (DL)-based QF, according to where channel estimation is performed and exploited and how the optimization problem is solved. Numerical results show that for the same fronthaul rate, the throughput significantly increases thanks to the mapping obtained using DCNN
    corecore