82 research outputs found

    A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

    Get PDF
    During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

    Differential encoding techniques applied to speech signals

    Get PDF
    The increasing use of digital communication systems has produced a continuous search for efficient methods of speech encoding. This thesis describes investigations of novel differential encoding systems. Initially Linear First Order DPCM systems employing a simple delayed encoding algorithm are examined. The systems detect an overload condition in the encoder, and through a simple algorithm reduce the overload noise at the expense of some increase in the quantization (granular) noise. The signal-to-noise ratio (snr) performance of such d codec has 1 to 2 dB's advantage compared to the First Order Linear DPCM system. In order to obtain a large improvement in snr the high correlation between successive pitch periods as well as the correlation between successive samples in the voiced speech waveform is exploited. A system called "Pitch Synchronous First Order DPCM" (PSFOD) has been developed. Here the difference Sequence formed between the samples of the input sequence in the current pitch period and the samples of the stored decoded sequence from the previous pitch period are encoded. This difference sequence has a smaller dynamic range than the original input speech sequence enabling a quantizer with better resolution to be used for the same transmission bit rate. The snr is increased by 6 dB compared with the peak snr of a First Order DPCM codea. A development of the PSFOD system called a Pitch Synchronous Differential Predictive Encoding system (PSDPE) is next investigated. The principle of its operation is to predict the next sample in the voiced-speech waveform, and form the prediction error which is then subtracted from the corresponding decoded prediction error in the previous pitch period. The difference is then encoded and transmitted. The improvement in snr is approximately 8 dB compared to an ADPCM codea, when the PSDPE system uses an adaptive PCM encoder. The snr of the system increases further when the efficiency of the predictors used improve. However, the performance of a predictor in any differential system is closely related to the quantizer used. The better the quantization the more information is available to the predictor and the better the prediction of the incoming speech samples. This leads automatically to the investigation in techniques of efficient quantization. A novel adaptive quantization technique called Dynamic Ratio quantizer (DRQ) is then considered and its theory presented. The quantizer uses an adaptive non-linear element which transforms the input samples of any amplitude to samples within a defined amplitude range. A fixed uniform quantizer quantizes the transformed signal. The snr for this quantizer is almost constant over a range of input power limited in practice by the dynamia range of the adaptive non-linear element, and it is 2 to 3 dB's better than the snr of a One Word Memory adaptive quantizer. Digital computer simulation techniques have been used widely in the above investigations and provide the necessary experimental flexibility. Their use is described in the text

    Image compression techniques using vector quantization

    Get PDF

    Signal Processing Research Program

    Get PDF
    Contains table of contents for Part III, table of contents for Section 1, an introduction and reports on fourteen research projects.Charles S. Draper Laboratory Contract DL-H-404158U.S. Navy - Office of Naval Research Grant N00014-89-J-1489National Science Foundation Grant MIP 87-14969Battelle LaboratoriesTel-Aviv University, Department of Electronic SystemsU.S. Army Research Office Contract DAAL03-86-D-0001The Federative Republic of Brazil ScholarshipSanders Associates, Inc.Bell Northern Research, Ltd.Amoco Foundation FellowshipGeneral Electric FellowshipNational Science Foundation FellowshipU.S. Air Force - Office of Scientific Research FellowshipU.S. Navy - Office of Naval Research Grant N00014-85-K-0272Natural Science and Engineering Research Council of Canada - Science and Technology Scholarshi

    Efficient compression of motion compensated residuals

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Estimation and Modeling Problems in Parametric Audio Coding

    Get PDF

    Data hiding in images based on fractal modulation and diversity combining

    Get PDF
    The current work provides a new data-embedding infrastructure based on fractal modulation. The embedding problem is tackled from a communications point of view. The data to be embedded becomes the signal to be transmitted through a watermark channel. The channel could be the image itself or some manipulation of the image. The image self noise and noise due to attacks are the two sources of noise in this paradigm. At the receiver, the image self noise has to be suppressed, while noise due to the attacks may sometimes be predicted and inverted. The concepts of fractal modulation and deterministic self-similar signals are extended to 2-dimensional images. These novel techniques are used to build a deterministic bi-homogenous watermark signal that embodies the binary data to be embedded. The binary data to be embedded, is repeated and scaled with different amplitudes at each level and is used as the wavelet decomposition pyramid. The binary data is appended with special marking data, which is used during demodulation, to identify and correct unreliable or distorted blocks of wavelet coefficients. This specially constructed pyramid is inverted using the inverse discrete wavelet transform to obtain the self-similar watermark signal. In the data embedding stage, the well-established linear additive technique is used to add the watermark signal to the cover image, to generate the watermarked (stego) image. Data extraction from a potential stego image is done using diversity combining. Neither the original image nor the original binary sequence (or watermark signal) is required during the extraction. A prediction of the original image is obtained using a cross-shaped window and is used to suppress the image self noise in the potential stego image. The resulting signal is then decomposed using the discrete wavelet transform. The number of levels and the wavelet used are the same as those used in the watermark signal generation stage. A thresholding process similar to wavelet de-noising is used to identify whether a particular coefficient is reliable or not. A decision is made as to whether a block is reliable or not based on the marking data present in each block and sometimes corrections are applied to the blocks. Finally the selected blocks are combined based on the diversity combining strategy to extract the embedded binary data

    Self-similarity and wavelet forms for the compression of still image and video data

    Get PDF
    This thesis is concerned with the methods used to reduce the data volume required to represent still images and video sequences. The number of disparate still image and video coding methods increases almost daily. Recently, two new strategies have emerged and have stimulated widespread research. These are the fractal method and the wavelet transform. In this thesis, it will be argued that the two methods share a common principle: that of self-similarity. The two will be related concretely via an image coding algorithm which combines the two, normally disparate, strategies. The wavelet transform is an orientation selective transform. It will be shown that the selectivity of the conventional transform is not sufficient to allow exploitation of self-similarity while keeping computational cost low. To address this, a new wavelet transform is presented which allows for greater orientation selectivity, while maintaining the orthogonality and data volume of the conventional wavelet transform. Many designs for vector quantizers have been published recently and another is added to the gamut by this work. The tree structured vector quantizer presented here is on-line and self structuring, requiring no distinct training phase. Combining these into a still image data compression system produces results which are among the best that have been published to date. An extension of the two dimensional wavelet transform to encompass the time dimension is straightforward and this work attempts to extrapolate some of its properties into three dimensions. The vector quantizer is then applied to three dimensional image data to produce a video coding system which, while not optimal, produces very encouraging results
    corecore