13,149 research outputs found
A Comparison of Front-Ends for Bitstream-Based ASR over IP
Automatic speech recognition (ASR) is called to play a relevant role in the provision of spoken interfaces for IP-based applications. However, as a consequence of the transit of the speech signal over these particular networks, ASR systems need to face two new challenges: the impoverishment of the speech quality due to the compression needed to fit the channel capacity and the inevitable occurrence of packet losses.
In this framework, bitstream-based approaches that obtain the ASR feature vectors directly from the coded bitstream, avoiding the speech decoding process, have been proposed ([S.H. Choi, H.K. Kim, H.S. Lee, Speech recognition using quantized LSP parameters and their transformations in digital communications, Speech Commun. 30 (4) (2000) 223â233. A. Gallardo-AntolĂn, C. PelĂ ez-Moreno, F. DĂaz-de-MarĂa, Recognizing GSM digital speech, IEEE Trans. Speech Audio Process., to appear. H.K. Kim, R.V. Cox, R.C. Rose, Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments, IEEE Trans. Speech Audio Process. 10 (8) (2002) 591â604. C. PelĂĄez-Moreno, A. Gallardo-AntolĂn, F. DĂaz-de-MarĂa, Recognizing voice over IP networks: a robust front-end for speech recognition on the WWW, IEEE Trans. Multimedia 3(2) (2001) 209â218], among others) to improve the robustness of ASR systems. LSP (Line Spectral Pairs) are the preferred set of parameters for the description of the speech spectral envelope in most of the modern speech coders. Nevertheless, LSP have proved to be unsuitable for ASR, and they must be transformed into cepstrum-type parameters. In this paper we comparatively evaluate the robustness of the most significant LSP to cepstrum transformations in a simulated VoIP (voice over IP) environment which includes two of the most popular codecs used in that network (G.723.1 and G.729) and several network conditions. In particular, we compare âpseudocepstrumâ [H.K. Kim, S.H. Choi, H.S. Lee, On approximating Line Spectral Frequencies to LPC cepstral coefficients, IEEE Trans. Speech Audio Process. 8 (2) (2000) 195â199], an approximated but straightforward transformation of LSP into LP cepstral coefficients, with a more computationally demanding but exact one. Our results show that pseudocepstrum is preferable when network conditions are good or computational resources low, while the exact procedure is recommended when network conditions become more adverse.Publicad
Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web
The Internet Protocol (IP) environment poses two relevant sources of distortion to the speech recognition problem: lossy speech coding and packet loss. In this paper, we propose a new front-end for speech recognition over IP networks. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bit stream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant benefits. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion due to the encoding-decoding process. Second, when packet loss occurs, our front-end becomes more effective since it is not constrained to the error handling mechanism of the codec. We have considered the ITU G.723.1 standard codec, which is one of the most preponderant coding algorithms in voice over IP (VoIP) and compared the proposed front-end with the conventional approach in two automatic speech recognition (ASR) tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated packet loss rates. Furthermore, the improvement is higher as network conditions worsen.Publicad
Bandwidth extension of narrowband speech
Recently, 4G mobile phone systems have been
designed to process wideband speech signals whose
sampling frequency is 16 kHz. However, most part of
mobile and classical phone network, and current 3G
mobile phones, still process narrowband speech signals
whose sampling frequency is 8 kHz. During next future,
all these systems must be living together. Therefore,
sometimes a wideband speech signal (with a bandwidth up
to 7,2 kHz) should be estimated from an available
narrowband one (whose frequency band is 300-3400 Hz).
In this work, different techniques of audio bandwidth
extension have been implemented and evaluated. First, a
simple non-model-based algorithm (interpolation
algorithm) has been implemented. Second, a model-based
algorithm (linear mapping) have been designed and
evaluated in comparison to previous one. Several CMOS
(Comparison Mean Opinion Score) [6] listening tests show
that performance of Linear Mapping algorithm clearly
overcomes the other one. Results of these tests are very
close to those corresponding to original wideband speech
signal.Postprint (published version
Classical sampling theorems in the context of multirate and polyphase digital filter bank structures
The recovery of a signal from so-called generalized samples is a problem of designing appropriate linear filters called reconstruction (or synthesis) filters. This relationship is reviewed and explored. Novel theorems for the subsampling of sequences are derived by direct use of the digital-filter-bank framework. These results are related to the theory of perfect reconstruction in maximally decimated digital-filter-bank systems. One of the theorems pertains to the subsampling of a sequence and its first few differences and its subsequent stable reconstruction at finite cost with no error. The reconstruction filters turn out to be multiplierless and of the FIR (finite impulse response) type. These ideas are extended to the case of two-dimensional signals by use of a Kronecker formalism. The subsampling of bandlimited sequences is also considered. A sequence x(n ) with a Fourier transform vanishes for |Ï|⩾LÏ/M, where L and M are integers with L<M, can in principle be represented by reducing the data rate by the amount M/L. The digital polyphase framework is used as a convenient tool for the derivation as well as mechanization of the sampling theorem
Distributed video coding for wireless video sensor networks: a review of the state-of-the-art architectures
Distributed video coding (DVC) is a relatively new video coding architecture originated from two fundamental theorems namely, SlepianâWolf and WynerâZiv. Recent research developments have made DVC attractive for applications in the emerging domain of wireless video sensor networks (WVSNs). This paper reviews the state-of-the-art DVC architectures with a focus on understanding their opportunities and gaps in addressing the operational requirements and application needs of WVSNs
A variable rate speech compressor for mobile applications
One of the most promising speech coder at the bit rate of 9.6 to 4.8 kbits/s is CELP. Code Excited Linear Prediction (CELP) has been dominating 9.6 to 4.8 kbits/s region during the past 3 to 4 years. Its set back however, is its expensive implementation. As an alternative to CELP, the Base-Band CELP (CELP-BB) was developed which produced good quality speech comparable to CELP and a single chip implementable complexity as reported previously. Its robustness was also improved to tolerate errors up to 1.0 pct. and maintain intelligibility up to 5.0 pct. and more. Although, CELP-BB produces good quality speech at around 4.8 kbits/s, it has a fundamental problem when updating the pitch filter memory. A sub-optimal solution is proposed for this problem. Below 4.8 kbits/s, however, CELP-BB suffers from noticeable quantization noise as a result of the large vector dimensions used. Efficient representation of speech below 4.8 kbits/s is reported by introducing Sinusoidal Transform Coding (STC) to represent the LPC excitation which is called Sine Wave Excited LPC (SWELP). In this case, natural sounding good quality synthetic speech is obtained at around 2.4 kbits/s
Reducing Audible Spectral Discontinuities
In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon.We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities
- âŠ