425 research outputs found
Multimodal methods for blind source separation of audio sources
The enhancement of the performance of frequency domain convolutive
blind source separation (FDCBSS) techniques when applied to the
problem of separating audio sources recorded in a room environment
is the focus of this thesis. This challenging application is termed the
cocktail party problem and the ultimate aim would be to build a machine
which matches the ability of a human being to solve this task.
Human beings exploit both their eyes and their ears in solving this task
and hence they adopt a multimodal approach, i.e. they exploit both
audio and video modalities. New multimodal methods for blind source
separation of audio sources are therefore proposed in this work as a
step towards realizing such a machine.
The geometry of the room environment is initially exploited to improve
the separation performance of a FDCBSS algorithm. The positions
of the human speakers are monitored by video cameras and this
information is incorporated within the FDCBSS algorithm in the form
of constraints added to the underlying cross-power spectral density
matrix-based cost function which measures separation performance. [Continues.
A multimodal approach to blind source separation of moving sources
A novel multimodal approach is proposed to solve the
problem of blind source separation (BSS) of moving sources. The
challenge of BSS for moving sources is that the mixing filters are
time varying; thus, the unmixing filters should also be time varying,
which are difficult to calculate in real time. In the proposed approach,
the visual modality is utilized to facilitate the separation for
both stationary and moving sources. The movement of the sources
is detected by a 3-D tracker based on video cameras. Positions
and velocities of the sources are obtained from the 3-D tracker
based on a Markov Chain Monte Carlo particle filter (MCMC-PF),
which results in high sampling efficiency. The full BSS solution
is formed by integrating a frequency domain blind source separation
algorithm and beamforming: if the sources are identified
as stationary for a certain minimum period, a frequency domain
BSS algorithm is implemented with an initialization derived from
the positions of the source signals. Once the sources are moving, a
beamforming algorithm which requires no prior statistical knowledge
is used to perform real time speech enhancement and provide
separation of the sources. Experimental results confirm that
by utilizing the visual modality, the proposed algorithm not only
improves the performance of the BSS algorithm and mitigates the
permutation problem for stationary sources, but also provides a
good BSS performance for moving sources in a low reverberant
environment
Multimodal blind source separation for moving sources
A novel multimodal approach is proposed to solve the problem of
blind source separation (BSS) of moving sources. The challenge
of BSS for moving sources is that the mixing filters are time varying,
thus the unmixing filters should also be time varying, which are
difficult to track in real time. In the proposed approach, the visual
modality is utilized to facilitate the separation for both stationary and
moving sources. The movement of the sources is detected by a 3-D
tracker based on particle filtering. The full BSS solution is formed
by integrating a frequency domain blind source separation algorithm
and beamforming: if the sources are identified as stationary, a frequency
domain BSS algorithm is implemented with an initialization
derived from the visual information. Once the sources are moving,
a beamforming algorithm is used to perform real time speech
enhancement and provide separation of the sources. Experimental
results show that by utilizing the visual modality, the proposed algorithm
can not only improve the performance of the BSS algorithm
and mitigate the permutation problem for stationary sources, but also
provide a good BSS performance for moving sources in a low reverberant
environment
Multimodal blind source separation for moving sources
A novel multimodal approach is proposed to solve the problem of
blind source separation (BSS) of moving sources. The challenge
of BSS for moving sources is that the mixing filters are time varying,
thus the unmixing filters should also be time varying, which are
difficult to track in real time. In the proposed approach, the visual
modality is utilized to facilitate the separation for both stationary and
moving sources. The movement of the sources is detected by a 3-D
tracker based on particle filtering. The full BSS solution is formed
by integrating a frequency domain blind source separation algorithm
and beamforming: if the sources are identified as stationary, a frequency
domain BSS algorithm is implemented with an initialization
derived from the visual information. Once the sources are moving,
a beamforming algorithm is used to perform real time speech
enhancement and provide separation of the sources. Experimental
results show that by utilizing the visual modality, the proposed algorithm
can not only improve the performance of the BSS algorithm
and mitigate the permutation problem for stationary sources, but also
provide a good BSS performance for moving sources in a low reverberant
environment
Informed algorithms for sound source separation in enclosed reverberant environments
While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing.
Initially, a multi-microphone array based method combined with binary
time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is
utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise.
To tackle the under-determined case and further improve separation performance
at higher reverberation times, a two-microphone based method
which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference,
interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned
through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as
the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial
characteristics of the enclosure and further improves the separation performance
in challenging scenarios i.e. when sources are in close proximity and
when the level of reverberation is high.
Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone
reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses
On the Suppression of Noise from a Fast Moving Acoustic Source using Multimodality
International audienceThe problem of cancelling the noise from a moving acoustic source in outdoor environment is investigated in this paper. By making use of the known instantaneous location of the moving source (provided by a second modality), we propose a time-domain method for removing the noise from a moving source in a mixture of acoustic sources. The proposed method consists in resampling the mixed data recorded at a reference sensor, and by linearly combining the resampled data and the non-resampled data of the others sensor to cancel the undesired source. Simulation on synthetic data show the effectiveness and the usefulness of the proposed method
Video-aided model-based source separation in real reverberant rooms
Source separation algorithms that utilize only audio
data can perform poorly if multiple sources or reverberation
are present. In this paper we therefore propose a video-aided
model-based source separation algorithm for a two-channel
reverberant recording in which the sources are assumed static.
By exploiting cues from video, we first localize individual speech
sources in the enclosure and then estimate their directions.
The interaural spatial cues, the interaural phase difference and
the interaural level difference, as well as the mixing vectors
are probabilistically modeled. The models make use of the
source direction information and are evaluated at discrete timefrequency
points. The model parameters are refined with the wellknown
expectation-maximization (EM) algorithm. The algorithm
outputs time-frequency masks that are used to reconstruct the
individual sources. Simulation results show that by utilizing the
visual modality the proposed algorithm can produce better timefrequency
masks thereby giving improved source estimates. We
provide experimental results to test the proposed algorithm in
different scenarios and provide comparisons with both other
audio-only and audio-visual algorithms and achieve improved
performance both on synthetic and real data. We also include
dereverberation based pre-processing in our algorithm in order
to suppress the late reverberant components from the observed
stereo mixture and further enhance the overall output of the algorithm.
This advantage makes our algorithm a suitable candidate
for use in under-determined highly reverberant settings where
the performance of other audio-only and audio-visual methods
is limited
- …