4,675 research outputs found
Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder
Inspired by the success of deep neural networks (DNNs) in speech processing,
this paper presents Deep Vocoder, a direct end-to-end low bit rate speech
compression method with deep autoencoder (DAE). In Deep Vocoder, DAE is used
for extracting the latent representing features (LRFs) of speech, which are
then efficiently quantized by an analysis-by-synthesis vector quantization (AbS
VQ) method. AbS VQ aims to minimize the perceptual spectral reconstruction
distortion rather than the distortion of LRFs vector itself. Also, a suboptimal
codebook searching technique is proposed to further reduce the computational
complexity. Experimental results demonstrate that Deep Vocoder yields
substantial improvements in terms of frequency-weighted segmental SNR, STOI and
PESQ score when compared to the output of the conventional SQ- or VQ-based
codec. The yielded PESQ score over the TIMIT corpus is 3.34 and 3.08 for speech
coding at 2400 bit/s and 1200 bit/s, respectively
Spoken Language Identification Using Hybrid Feature Extraction Methods
This paper introduces and motivates the use of hybrid robust feature
extraction technique for spoken language identification (LID) system. The
speech recognizers use a parametric form of a signal to get the most important
distinguishable features of speech signal for recognition task. In this paper
Mel-frequency cepstral coefficients (MFCC), Perceptual linear prediction
coefficients (PLP) along with two hybrid features are used for language
Identification. Two hybrid features, Bark Frequency Cepstral Coefficients
(BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP) were
obtained from combination of MFCC and PLP. Two different classifiers, Vector
Quantization (VQ) with Dynamic Time Warping (DTW) and Gaussian Mixture Model
(GMM) were used for classification. The experiment shows better identification
rate using hybrid feature extraction techniques compared to conventional
feature extraction methods.BFCC has shown better performance than MFCC with
both classifiers. RPLP along with GMM has shown best identification performance
among all feature extraction techniques
A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)
Numerous studies have investigated the effectiveness of neural network
quantization on pattern classification tasks. The present study, for the first
time, investigated the performance of speech enhancement (a regression task in
speech processing) using a novel exponent-only floating-point quantized neural
network (EOFP-QNN). The proposed EOFP-QNN consists of two stages:
mantissa-quantization and exponent-quantization. In the mantissa-quantization
stage, EOFP-QNN learns how to quantize the mantissa bits of the model
parameters while preserving the regression accuracy using the least mantissa
precision. In the exponent-quantization stage, the exponent part of the
parameters is further quantized without causing any additional performance
degradation. We evaluated the proposed EOFP quantization technique on two types
of neural networks, namely, bidirectional long short-term memory (BLSTM) and
fully convolutional neural network (FCN), on a speech enhancement task.
Experimental results showed that the model sizes can be significantly reduced
(the model sizes of the quantized BLSTM and FCN models were only 18.75% and
21.89%, respectively, compared to those of the original models) while
maintaining satisfactory speech-enhancement performance
A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement
Single-channel speech enhancement with deep neural networks (DNNs) has shown
promising performance and is thus intensively being studied. In this paper,
instead of applying the mean squared error (MSE) as the loss function during
DNN training for speech enhancement, we design a perceptual weighting filter
loss motivated by the weighting filter as it is employed in
analysis-by-synthesis speech coding, e.g., in code-excited linear prediction
(CELP). The experimental results show that the proposed simple loss function
improves the speech enhancement performance compared to a reference DNN with
MSE loss in terms of perceptual quality and noise attenuation. The proposed
loss function can be advantageously applied to an existing DNN-based speech
enhancement system, without modification of the DNN topology for speech
enhancement. The source code for the proposed approach is made available
Group-theoretic structure of linear phase multirate filter banks
Unique lifting factorization results for group lifting structures are used to
characterize the group-theoretic structure of two-channel linear phase FIR
perfect reconstruction filter bank groups. For D-invariant, order-increasing
group lifting structures, it is shown that the associated lifting cascade group
C is isomorphic to the free product of the upper and lower triangular lifting
matrix groups. Under the same hypotheses, the associated scaled lifting group S
is the semidirect product of C by the diagonal gain scaling matrix group D.
These results apply to the group lifting structures for the two principal
classes of linear phase perfect reconstruction filter banks, the whole- and
half-sample symmetric classes. Since the unimodular whole-sample symmetric
class forms a group, W, that is in fact equal to its own scaled lifting group,
W=S_W, the results of this paper characterize the group-theoretic structure of
W up to isomorphism. Although the half-sample symmetric class H does not form a
group, it can be partitioned into cosets of its lifting cascade group, C_H, or,
alternatively, into cosets of its scaled lifting group, S_H. Homomorphic
comparisons reveal that scaled lifting groups covered by the results in this
paper have a structure analogous to a "noncommutative vector space."Comment: 33 pages, 6 figures; to appear in IEEE Transactions on Information
Theor
Secure Transmission of Password Using Speech Watermarking
Internet is one of the most valuable resources for information communication
and retrievals. Most multimedia signals today are in digital formats. The
digital data can be duplicated and edited with great ease which has led to a
need for data integrity and protection of digital data. The security
requirements such as integrity or data authentication can be met by
implementing security measures using digital watermarking techniques. In this
paper a blind speech watermarking algorithm that embeds the watermark signal
data in the musical (sequence) host signal by using frequency masking is used.
A different logarithmic approach is proposed. In this regard a logarithmic
function is first applied to watermark data. Then the transformed signal is
embedded to the converted version of host signal which is obtained by applying
Fast Fourier transform method. Finally using inverse Fast Fourier Transform and
antilogarithmic function watermark signal is retrieved.Comment: Pages: 4 Figures: 7, International Journal of Computer Science and
Technology (IJCST) Vol.2, Issue 3, September 201
Incorporating Symbolic Sequential Modeling for Speech Enhancement
In a noisy environment, a lossy speech signal can be automatically restored
by a listener if he/she knows the language well. That is, with the built-in
knowledge of a "language model", a listener may effectively suppress noise
interference and retrieve the target speech signals. Accordingly, we argue that
familiarity with the underlying linguistic content of spoken utterances
benefits speech enhancement (SE) in noisy environments. In this study, in
addition to the conventional modeling for learning the acoustic noisy-clean
speech mapping, an abstract symbolic sequential modeling is incorporated into
the SE framework. This symbolic sequential modeling can be regarded as a
"linguistic constraint" in learning the acoustic noisy-clean speech mapping
function. In this study, the symbolic sequences for acoustic signals are
obtained as discrete representations with a Vector Quantized Variational
Autoencoder algorithm. The obtained symbols are able to capture high-level
phoneme-like content from speech signals. The experimental results demonstrate
that the proposed framework can obtain notable performance improvement in terms
of perceptual evaluation of speech quality (PESQ) and short-time objective
intelligibility (STOI) on the TIMIT dataset.Comment: Accepted to Interspeech 201
Informed Source Separation using Iterative Reconstruction
This paper presents a technique for Informed Source Separation (ISS) of a
single channel mixture, based on the Multiple Input Spectrogram Inversion
method. The reconstruction of the source signals is iterative, alternating
between a time- frequency consistency enforcement and a re-mixing constraint. A
dual resolution technique is also proposed, for sharper transients
reconstruction. The two algorithms are compared to a state-of-the-art
Wiener-based ISS technique, on a database of fourteen monophonic mixtures, with
standard source separation objective measures. Experimental results show that
the proposed algorithms outperform both this reference technique and the oracle
Wiener filter by up to 3dB in distortion, at the cost of a significantly
heavier computation.Comment: submitted to the IEEE transactions on Audio, Speech and Language
Processin
High-Quality, Low-Delay Music Coding in the Opus Codec
The IETF recently standardized the Opus codec as RFC6716. Opus targets a wide
range of real-time Internet applications by combining a linear prediction coder
with a transform coder. We describe the transform coder, with particular
attention to the psychoacoustic knowledge built into the format. The result
out-performs existing audio codecs that do not operate under real-time
constraints.Comment: 10 pages, 135th AES Convention. Proceedings of the 135th AES
Convention, October 201
A Full-Bandwidth Audio Codec With Low Complexity And Very Low Delay
We propose an audio codec that addresses the low-delay requirements of some
applications such as network music performance. The codec is based on the
modified discrete cosine transform (MDCT) with very short frames and uses
gain-shape quantization to preserve the spectral envelope. The short frame
sizes required for low delay typically hinder the performance of transform
codecs. However, at 96 kbit/s and with only 4 ms algorithmic delay, the
proposed codec out-performs the ULD codec operating at the same rate. The total
complexity of the codec is small, at only 17 WMOPS for real-time operation at
48 kHz.Comment: 5 pages, Proceedings of EUSIPCO 200
- …