Search CORE

4,675 research outputs found

Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder

Author: Min Gang
Tan Wei
Zhang Changqing
Zhang Xiongwei
Publication venue
Publication date: 14/05/2019
Field of study

Inspired by the success of deep neural networks (DNNs) in speech processing, this paper presents Deep Vocoder, a direct end-to-end low bit rate speech compression method with deep autoencoder (DAE). In Deep Vocoder, DAE is used for extracting the latent representing features (LRFs) of speech, which are then efficiently quantized by an analysis-by-synthesis vector quantization (AbS VQ) method. AbS VQ aims to minimize the perceptual spectral reconstruction distortion rather than the distortion of LRFs vector itself. Also, a suboptimal codebook searching technique is proposed to further reduce the computational complexity. Experimental results demonstrate that Deep Vocoder yields substantial improvements in terms of frequency-weighted segmental SNR, STOI and PESQ score when compared to the output of the conventional SQ- or VQ-based codec. The yielded PESQ score over the TIMIT corpus is 3.34 and 3.08 for speech coding at 2400 bit/s and 1200 bit/s, respectively

arXiv.org e-Print Archive

Spoken Language Identification Using Hybrid Feature Extraction Methods

Author: Biswas Astik
Chandra Mahesh
Kumar Pawan
Mishra A . N.
Publication venue
Publication date: 29/03/2010
Field of study

This paper introduces and motivates the use of hybrid robust feature extraction technique for spoken language identification (LID) system. The speech recognizers use a parametric form of a signal to get the most important distinguishable features of speech signal for recognition task. In this paper Mel-frequency cepstral coefficients (MFCC), Perceptual linear prediction coefficients (PLP) along with two hybrid features are used for language Identification. Two hybrid features, Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP) were obtained from combination of MFCC and PLP. Two different classifiers, Vector Quantization (VQ) with Dynamic Time Warping (DTW) and Gaussian Mixture Model (GMM) were used for classification. The experiment shows better identification rate using hybrid feature extraction techniques compared to conventional feature extraction methods.BFCC has shown better performance than MFCC with both classifiers. RPLP along with GMM has shown best identification performance among all feature extraction techniques

arXiv.org e-Print Archive

A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)

Author: Fu Szu-Wei
Hsu Yi-Te
Kuo Tei-Wei
Lin Yu-Chen
Tsao Yu
Publication venue
Publication date: 30/10/2018
Field of study

Numerous studies have investigated the effectiveness of neural network quantization on pattern classification tasks. The present study, for the first time, investigated the performance of speech enhancement (a regression task in speech processing) using a novel exponent-only floating-point quantized neural network (EOFP-QNN). The proposed EOFP-QNN consists of two stages: mantissa-quantization and exponent-quantization. In the mantissa-quantization stage, EOFP-QNN learns how to quantize the mantissa bits of the model parameters while preserving the regression accuracy using the least mantissa precision. In the exponent-quantization stage, the exponent part of the parameters is further quantized without causing any additional performance degradation. We evaluated the proposed EOFP quantization technique on two types of neural networks, namely, bidirectional long short-term memory (BLSTM) and fully convolutional neural network (FCN), on a speech enhancement task. Experimental results showed that the model sizes can be significantly reduced (the model sizes of the quantized BLSTM and FCN models were only 18.75% and 21.89%, respectively, compared to those of the original models) while maintaining satisfactory speech-enhancement performance

arXiv.org e-Print Archive

A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement

Author: Elshamy Samy
Fingscheidt Tim
Zhao Ziyue
Publication venue
Publication date: 18/08/2019
Field of study

Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement. The source code for the proposed approach is made available

arXiv.org e-Print Archive

Group-theoretic structure of linear phase multirate filter banks

Author: Brislawn Christopher M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/09/2013
Field of study

Unique lifting factorization results for group lifting structures are used to characterize the group-theoretic structure of two-channel linear phase FIR perfect reconstruction filter bank groups. For D-invariant, order-increasing group lifting structures, it is shown that the associated lifting cascade group C is isomorphic to the free product of the upper and lower triangular lifting matrix groups. Under the same hypotheses, the associated scaled lifting group S is the semidirect product of C by the diagonal gain scaling matrix group D. These results apply to the group lifting structures for the two principal classes of linear phase perfect reconstruction filter banks, the whole- and half-sample symmetric classes. Since the unimodular whole-sample symmetric class forms a group, W, that is in fact equal to its own scaled lifting group, W=S_W, the results of this paper characterize the group-theoretic structure of W up to isomorphism. Although the half-sample symmetric class H does not form a group, it can be partitioned into cosets of its lifting cascade group, C_H, or, alternatively, into cosets of its scaled lifting group, S_H. Homomorphic comparisons reveal that scaled lifting groups covered by the results in this paper have a structure analogous to a "noncommutative vector space."Comment: 33 pages, 6 figures; to appear in IEEE Transactions on Information Theor

arXiv.org e-Print Archive

Secure Transmission of Password Using Speech Watermarking

Author: Patel Rupa
Shrawankar Urmila
Thakare V. M
Publication venue
Publication date: 25/04/2013
Field of study

Internet is one of the most valuable resources for information communication and retrievals. Most multimedia signals today are in digital formats. The digital data can be duplicated and edited with great ease which has led to a need for data integrity and protection of digital data. The security requirements such as integrity or data authentication can be met by implementing security measures using digital watermarking techniques. In this paper a blind speech watermarking algorithm that embeds the watermark signal data in the musical (sequence) host signal by using frequency masking is used. A different logarithmic approach is proposed. In this regard a logarithmic function is first applied to watermark data. Then the transformed signal is embedded to the converted version of host signal which is obtained by applying Fast Fourier transform method. Finally using inverse Fast Fourier Transform and antilogarithmic function watermark signal is retrieved.Comment: Pages: 4 Figures: 7, International Journal of Computer Science and Technology (IJCST) Vol.2, Issue 3, September 201

arXiv.org e-Print Archive

Incorporating Symbolic Sequential Modeling for Speech Enhancement

Author: Kawai Hisashi
Liao Chien-Feng
Lu Xugang
Tsao Yu
Publication venue
Publication date: 01/07/2019
Field of study

In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset.Comment: Accepted to Interspeech 201

arXiv.org e-Print Archive

Informed Source Separation using Iterative Reconstruction

Author: Daudet Laurent
Sturmel Nicolas
Publication venue
Publication date: 09/02/2012
Field of study

This paper presents a technique for Informed Source Separation (ISS) of a single channel mixture, based on the Multiple Input Spectrogram Inversion method. The reconstruction of the source signals is iterative, alternating between a time- frequency consistency enforcement and a re-mixing constraint. A dual resolution technique is also proposed, for sharper transients reconstruction. The two algorithms are compared to a state-of-the-art Wiener-based ISS technique, on a database of fourteen monophonic mixtures, with standard source separation objective measures. Experimental results show that the proposed algorithms outperform both this reference technique and the oracle Wiener filter by up to 3dB in distortion, at the cost of a significantly heavier computation.Comment: submitted to the IEEE transactions on Audio, Speech and Language Processin

arXiv.org e-Print Archive

High-Quality, Low-Delay Music Coding in the Opus Codec

Author: Maxwell Gregory
Terriberry Timothy B.
Valin Jean-Marc
Vos Koen
Publication venue
Publication date: 15/02/2016
Field of study

The IETF recently standardized the Opus codec as RFC6716. Opus targets a wide range of real-time Internet applications by combining a linear prediction coder with a transform coder. We describe the transform coder, with particular attention to the psychoacoustic knowledge built into the format. The result out-performs existing audio codecs that do not operate under real-time constraints.Comment: 10 pages, 135th AES Convention. Proceedings of the 135th AES Convention, October 201

arXiv.org e-Print Archive

A Full-Bandwidth Audio Codec With Low Complexity And Very Low Delay

Author: Maxwell Gregory
Terriberry Timothy B.
Valin Jean-Marc
Publication venue
Publication date: 17/02/2016
Field of study

We propose an audio codec that addresses the low-delay requirements of some applications such as network music performance. The codec is based on the modified discrete cosine transform (MDCT) with very short frames and uses gain-shape quantization to preserve the spectral envelope. The short frame sizes required for low delay typically hinder the performance of transform codecs. However, at 96 kbit/s and with only 4 ms algorithmic delay, the proposed codec out-performs the ULD codec operating at the same rate. The total complexity of the codec is small, at only 17 WMOPS for real-time operation at 48 kHz.Comment: 5 pages, Proceedings of EUSIPCO 200

arXiv.org e-Print Archive