117 research outputs found
Spatial, Spectral, and Perceptual Nonlinear Noise Reduction for Hands-free Microphones in a Car
Speech enhancement in an automobile is a challenging problem because interference can come from engine noise, fans, music, wind, road noise, reverberation, echo, and passengers engaging in other conversations. Hands-free microphones make the situation worse because the strength of the desired speech signal reduces with increased distance between the microphone and talker. Automobile safety is improved when the driver can use a hands-free interface to phones and other devices instead of taking his eyes off the road. The demand for high quality hands-free communication in the automobile requires the introduction of more powerful algorithms. This thesis shows that a unique combination of five algorithms can achieve superior speech enhancement for a hands-free system when compared to beamforming or spectral subtraction alone. Several different designs were analyzed and tested before converging on the configuration that achieved the best results. Beamforming, voice activity detection, spectral subtraction, perceptual nonlinear weighting, and talker isolation via pitch tracking all work together in a complementary iterative manner to create a speech enhancement system capable of significantly enhancing real world speech signals. The following conclusions are supported by the simulation results using data recorded in a car and are in strong agreement with theory. Adaptive beamforming, like the Generalized Side-lobe Canceller (GSC), can be effectively used if the filters only adapt during silent data frames because too much of the desired speech is cancelled otherwise. Spectral subtraction removes stationary noise while perceptual weighting prevents the introduction of offensive audible noise artifacts. Talker isolation via pitch tracking can perform better when used after beamforming and spectral subtraction because of the higher accuracy obtained after initial noise removal. Iterating the algorithm once increases the accuracy of the Voice Activity Detection (VAD), which improves the overall performance of the algorithm. Placing the microphone(s) on the ceiling above the head and slightly forward of the desired talker appears to be the best location in an automobile based on the experiments performed in this thesis. Objective speech quality measures show that the algorithm removes a majority of the stationary noise in a hands-free environment of an automobile with relatively minimal speech distortion
Scalable and perceptual audio compression
This thesis deals with scalable perceptual audio compression. Two scalable perceptual solutions as well as a scalable to lossless solution are proposed and investigated. One of the scalable perceptual solutions is built around sinusoidal modelling of the audio signal whilst the other is built on a transform coding paradigm. The scalable coders are shown to scale both in a waveform matching manner as well as a psychoacoustic manner. In order to measure the psychoacoustic scalability of the systems investigated in this thesis, the similarity between the original signal\u27s psychoacoustic parameters and that of the synthesized signal are compared. The psychoacoustic parameters used are loudness, sharpness, tonahty and roughness. This analysis technique is a novel method used in this thesis and it allows an insight into the perceptual distortion that has been introduced by any coder analyzed in this manner
Intelligent Tools for Multitrack Frequency and Dynamics Processing
PhDThis research explores the possibility of reproducing mixing decisions of a skilled audio
engineer with minimal human interaction that can improve the overall listening experience of
musical mixtures, i.e., intelligent mixing. By producing a balanced mix automatically
musician and mixing engineering can focus on their creativity while the productivity of music
production is increased. We focus on the two essential aspects of such a system, frequency
and dynamics. This thesis presents an intelligent strategy for multitrack frequency and
dynamics processing that exploit the interdependence of input audio features, incorporates
best practices in audio engineering, and driven by perceptual models and subjective criteria.
The intelligent frequency processing research begins with a spectral characteristic analysis of
commercial recordings, where we discover a consistent leaning towards a target equalization
spectrum. A novel approach for automatically equalizing audio signals towards the observed
target spectrum is then described and evaluated. We proceed to dynamics processing, and
introduce an intelligent multitrack dynamic range compression algorithm, in which various
audio features are proposed and validated to better describe the transient nature and spectral
content of the signals. An experiment to investigate the human preference on dynamic
processing is described to inform our choices of parameter automations. To provide a
perceptual basis for the intelligent system, we evaluate existing perceptual models, and
propose several masking metrics to quantify the masking behaviour within the multitrack
mixture. Ultimately, we integrate previous research on auditory masking, frequency and
dynamics processing, into one intelligent system of mix optimization that replicates the
iterative process of human mixing. Within the system, we explore the relationship between
equalization and dynamics processing, and propose a general frequency and dynamics
processing framework. Various implementations of the intelligent system are explored and
evaluated objectively and subjectively through listening experiments.China Scholarship Council
Wavelet Filter Banks in Perceptual Audio Coding
This thesis studies the application of the wavelet filter bank (WFB) in perceptual audio coding by providing brief overviews of perceptual coding, psychoacoustics, wavelet theory, and existing wavelet coding algorithms. Furthermore, it describes the poor frequency localization property of the WFB and explores one filter design method, in particular, for improving channel separation between the wavelet bands. A wavelet audio coder has also been developed by the author to test the new filters. Preliminary tests indicate that the new filters provide some improvement over other wavelet filters when coding audio signals that are stationary-like and contain only a few harmonic components, and similar results for other types of audio signals that contain many spectral and temporal components.
It has been found that the WFB provides a flexible decomposition scheme through the choice of the tree structure and basis filter, but at the cost of poor localization properties. This flexibility can be a benefit in the context of audio coding but the poor localization properties represent a drawback. Determining ways to fully utilize this flexibility, while minimizing the effects of poor time-frequency localization, is an area that is still very much open for research
The removal of environmental noise in cellular communications by perceptual techniques
This thesis describes the application of a perceptually based spectral subtraction algorithm for
the enhancement of non-stationary noise corrupted speech. Through examination of speech enhancement
techniques, explanations are given for the choice of magnitude spectral subtraction
and how the human auditory system can be modelled for frequency domain speech enhancement.
It is discovered, that the cochlea provides the mechanical speech enhancement in the
auditory system, through the use of masking. Frequency masking is used in spectral subtraction,
to improve the algorithm execution time, and to shape the enhancement process making it
sound natural to the ear.
A new technique for estimation of background noise is presented, which operates during speech
sections as well as pauses. This uses two microphones placed on opposite ends of the cellular
handset. Using these, the algorithm determines whether the signal is speech, or noise, by
examining the current and next frames presented to each microphone. This allows operation in
non-stationary conditions, as the estimation is calculated for each frame, and a speech pause is
not required for updating. A voting decision process decides the presence of speech or noise
which determines which microphone the estimation is calculated from.
The importance of an accurate noise estimate is highlighted with a new technique to reduce
the effect of musical noise artifacts in the processed speech. This is a classic drawback of
spectral subtraction techniques, and it is shown, that the trade off between noise reduction and
speech distortion can be extended by this process. A new method for dealing with musical
noise is described, which uses a combination of energy and variance examination of the spectrogram
to segregate potential musical noise from desired speech sections. By examination of
the spectrogram points surrounding musical noise sections, perceptually relevant values replace
the corruption leading to cleaner enhanced speech.
Any perceptual speech system requires accurate estimates of the clean speech masking thresholds,
to prevent noisy sections being passed through the enhancement untouched. In this thesis, a
method for the calculation of the estimated clean speech masking thresholds is derived. Classically,
this requires an estimation of the clean speech before the thresholds can be derived,
but this results in inaccuracy due to the presence of musical noise and spectral nulls. The
new algorithm examines the thresholds produced by the corrupted speech, and the background
noise, and from these determines the relationship between the two, to produce an estimate of
the clean thresholds, with no operation performed on the actual speech signal. A discrepancy is
found between the results for male and female speech, which, by examination of the perceptual
process, is shown to be due to the different formant positions in male and female speech.
Following the development of these parts, the entire enhancement algorithm is tested on a range
of noise scenarios, using male and female speech. The results show, that the proposed algorithm
is able to provide adequate performance in terms of noise reduction and speech quality
- âŠ