19 research outputs found
Advanced signal processing techniques for pitch synchronous sinusoidal speech coders
Recent trends in commercial and consumer demand have led to the increasing use of multimedia applications in mobile and Internet telephony. Although audio, video and data communications are becoming more prevalent, a major application is and will remain the transmission of speech. Speech coding techniques suited to these new trends must be developed, not only to provide high quality speech communication but also to minimise the required bandwidth for speech, so as to maximise that available for the new audio, video and data services. The majority of current speech coders employed in mobile and Internet applications employ a Code Excited Linear Prediction (CELP) model. These coders attempt to reproduce the input speech signal and can produce high quality synthetic speech at bit rates above 8 kbps. Sinusoidal speech coders tend to dominate at rates below 6 kbps but due to limitations in the sinusoidal speech coding model, their synthetic speech quality cannot be significantly improved even if their bit rate is increased. Recent developments have seen the emergence and application of Pitch Synchronous (PS) speech coding techniques to these coders in order to remove the limitations of the sinusoidal speech coding model. The aim of the research presented in this thesis is to investigate and eliminate the factors that limit the quality of the synthetic speech produced by PS sinusoidal coders. In order to achieve this innovative signal processing techniques have been developed. New parameter analysis and quantisation techniques have been produced which overcome many of the problems associated with applying PS techniques to sinusoidal coders. In sinusoidal based coders, two of the most important elements are the correct formulation of pitch and voicing values from the' input speech. The techniques introduced here have greatly improved these calculations resulting in a higher quality PS sinusoidal speech coder than was previously available. A new quantisation method which is able to reduce the distortion from quantising speech spectral information has also been developed. When these new techniques are utilised they effectively raise the synthetic speech quality of sinusoidal coders to a level comparable to that produced by CELP based schemes, making PS sinusoidal coders a promising alternative at low to medium bit rates.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Quantisation mechanisms in multi-protoype waveform coding
Prototype Waveform Coding is one of the most promising methods for speech coding at low bit rates over telecommunications networks. This thesis investigates quantisation mechanisms in Multi-Prototype Waveform (MPW) coding, and two prototype waveform quantisation algorithms for speech coding at bit rates of 2.4kb/s are proposed. Speech coders based on these algorithms have been found to be capable of producing coded speech with equivalent perceptual quality to that generated by the US 1016 Federal Standard CELP-4.8kb/s algorithm. The two proposed prototype waveform quantisation algorithms are based on Prototype Waveform Interpolation (PWI). The first algorithm is in an open loop architecture (Open Loop Quantisation). In this algorithm, the speech residual is represented as a series of prototype waveforms (PWs). The PWs are extracted in both voiced and unvoiced speech, time aligned and quantised and, at the receiver, the excitation is reconstructed by smooth interpolation between them. For low bit rate coding, the PW is decomposed into a slowly evolving waveform (SEW) and a rapidly evolving waveform (REW). The SEW is coded using vector quantisation on both magnitude and phase spectra. The SEW codebook search is based on the best matching of the SEW and the SEW codebook vector. The REW phase spectra is not quantised, but it is recovered using Gaussian noise. The REW magnitude spectra, on the other hand, can be either quantised with a certain update rate or only derived according to SEW behaviours
Computer Models for Musical Instrument Identification
PhDA particular aspect in the perception of sound is concerned with what is commonly
termed as texture or timbre. From a perceptual perspective, timbre is what allows us
to distinguish sounds that have similar pitch and loudness. Indeed most people are
able to discern a piano tone from a violin tone or able to distinguish different voices
or singers.
This thesis deals with timbre modelling. Specifically, the formant theory of timbre
is the main theme throughout. This theory states that acoustic musical instrument
sounds can be characterised by their formant structures. Following this principle, the
central point of our approach is to propose a computer implementation for building
musical instrument identification and classification systems.
Although the main thrust of this thesis is to propose a coherent and unified
approach to the musical instrument identification problem, it is oriented towards the
development of algorithms that can be used in Music Information Retrieval (MIR)
frameworks. Drawing on research in speech processing, a complete supervised system
taking into account both physical and perceptual aspects of timbre is described.
The approach is composed of three distinct processing layers. Parametric models
that allow us to represent signals through mid-level physical and perceptual representations
are considered. Next, the use of the Line Spectrum Frequencies as spectral
envelope and formant descriptors is emphasised. Finally, the use of generative and
discriminative techniques for building instrument and database models is investigated.
Our system is evaluated under realistic recording conditions using databases of isolated
notes and melodic phrases
Progress report of a project in very low bit-rate speech coding
Background work in various levels of speech coding is reviewed, including unconstrained coding and recognition-synthesis approaches that assume the signal is speech. A pilot project in HMM-TTS based speech coding is then described, in which a comparison with harmonic plus noise modelling is also done. Results of the demonstration project including samples of speech under various transmission situations are presented in an accompanying web page. The report concludes by describing and enumerating the shortcomings of the demonstration system that define directions for future work. This work is a deliverable for the armasuisse funded project “RECOD - Low bit-rate speech coding
A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion
During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices.
This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method.
The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec.
Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing.
The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications
Recommended from our members
Modelling and extraction of fundamental frequency in speech signals
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.One of the most important parameters of speech is the fundamental frequency of vibration of voiced sounds. The audio sensation of the fundamental frequency is known as the pitch. Depending on the tonal/non-tonal category of language, the fundamental frequency conveys intonation, pragmatics and meaning. In addition the fundamental frequency and intonation carry speaker gender, age, identity, speaking style and emotional state. Accurate estimation of the fundamental frequency is critically important for functioning of speech processing applications such as speech coding, speech recognition, speech synthesis and voice morphing. This thesis makes contributions to the development of accurate pitch estimation research in three distinct ways: (1) an investigation of the impact of the window length on pitch estimation error, (2) an investigation of the use of the higher order moments and (3) an investigation of an analysis-synthesis method for selection of the best pitch value among N proposed candidates. Experimental evaluations show that the length of the speech window has a major impact on the accuracy of pitch estimation. Depending on the similarity criteria and the order of the statistical moment a window length of 37 to 80 ms gives the least error. In order to avoid excessive delay as a consequence of using a longer window, a method is proposed
ii where the current short window is concatenated with the previous frames to form a longer signal window for pitch extraction. The use of second order and higher order moments, and the magnitude difference function, as the similarity criteria were explored and compared. A novel method of calculation of moments is introduced where the signal is split, i.e. rectified, into positive and negative valued samples. The moments for the positive and negative parts of the signal are computed separately and combined. The new method of calculation of moments from positive and negative parts and the higher order criteria provide competitive results. A challenging issue in pitch estimation is the determination of the best candidate from N extrema of the similarity criteria. The analysis-synthesis method proposed in this thesis selects the pitch candidate that provides the best reproduction (synthesis) of the harmonic spectrum of the original speech. The synthesis method must be such that the distortion increases with the increasing error in the estimate of the fundamental frequency. To this end a new method of spectral synthesis is proposed using an estimate of the spectral envelop and harmonically spaced asymmetric Gaussian pulses as excitation. The N-best method provides consistent reduction in pitch estimation error. The methods described in this thesis result in a significant improvement in the pitch accuracy and outperform the benchmark YIN method
Recommended from our members
Time-frequency analysis based on split spectrum applied to audio and ultrasonic signals
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonSignal processing is a large subject with applications integral to a number of technological fields such as communication, audio, Voice over IP (VoIP), pattern recognition, sonar, radar, ultrasound and medical imaging. Techniques exist for the analysis, modelling, extraction, recognition and synthesis of signals of interest. The focus of this thesis is signal processing for acoustics (both sonic and ultrasonic). In the applications examined, signals of interest are usually incomplete, distorted and/or noisy. Therefore, reconstructing the signal, noise reduction and removal of any distortion/interference are the main goals of the signal processing techniques presented. The primary aim is to study and develop an advanced time-frequency signal processing technique for acoustic applications to enhance the quality of the signals. In the first part of the thesis, a technique is presented that models and maintains the correlation between temporal and spectral parameters of audio signals. A novel Packet Loss Concealment (PLC) method is developed with applications to VoIP, audio broadcasting, and streaming. The problem of modelling the time-varying frequency spectrum in the context of PLC is addressed, and a novel solution is proposed for tracking and using the temporal motion of spectral flow to reconstruct the signal. The proposed method utilises a Time-Frequency Motion (TFM) matrix representation of the audio signal, where each frequency is tagged with a motion vector estimate that is assessed by cross-correlation of the movement of spectral energy within sub-bands across time frames. The missing packets are estimated using extrapolation or interpolation algorithms using a TFM matrix and then inverse transformed to the time-domain for reconstruction of the signal. The proposed method is compared with conventional approaches using objective Performance Evaluation of Speech Quality (PESQ), and subjective Mean Opinion Scores (MOS) in a range of packet loss from 5% to 20%. The evaluation results demonstrate that the proposed algorithm substantially improves performance by an average of 2.85% and 5.9% in terms of PESQ and MOS respectively. In the second part of the thesis, the proposed method is extended and modified to address challenges of excessive coherent noise arising from ultrasonic signals gathered during Guided Wave Testing (GWT). It is an advanced Non-destructive testing technique which is used over several branches of industry to inspect large structures for defects where the structural integrity is of concern. In such systems, signal interpretation can often be challenging due to the multi-modal and dispersive propagation of Ultrasonic Guided Waves (UGWs). The multi-modal and dispersive nature of the received signals hampers the ability to detect defects in a given structure. The Split-Spectrum Processing (SSP) method with application for such signal has been studied and reviewed quantitatively to measure the enhancement in terms of Signal-to-Noise Ratio (SNR) and spatial resolution. In this thesis, the influence of SSP filter bank parameters on these signals is studied and optimised to improve SNR and spatial resolution considerably. The proposed method is compared analytically and experimentally with conventional approaches. The proposed SSP algorithm substantially improves SNR by an average of 30dB. The conclusions reached in this thesis will contribute to the progression of the GWT technique through considerable improvement in defect detection capability.Centre for Electronic Systems Research (CESR) of Brunel University London, The National Structural Integrity Research Centre (NSIRC) and TWI Ltd