8 research outputs found
Multimodal blind source separation for moving sources
A novel multimodal approach is proposed to solve the problem of
blind source separation (BSS) of moving sources. The challenge
of BSS for moving sources is that the mixing filters are time varying,
thus the unmixing filters should also be time varying, which are
difficult to track in real time. In the proposed approach, the visual
modality is utilized to facilitate the separation for both stationary and
moving sources. The movement of the sources is detected by a 3-D
tracker based on particle filtering. The full BSS solution is formed
by integrating a frequency domain blind source separation algorithm
and beamforming: if the sources are identified as stationary, a frequency
domain BSS algorithm is implemented with an initialization
derived from the visual information. Once the sources are moving,
a beamforming algorithm is used to perform real time speech
enhancement and provide separation of the sources. Experimental
results show that by utilizing the visual modality, the proposed algorithm
can not only improve the performance of the BSS algorithm
and mitigate the permutation problem for stationary sources, but also
provide a good BSS performance for moving sources in a low reverberant
environment
Multimodal blind source separation for moving sources
A novel multimodal approach is proposed to solve the problem of
blind source separation (BSS) of moving sources. The challenge
of BSS for moving sources is that the mixing filters are time varying,
thus the unmixing filters should also be time varying, which are
difficult to track in real time. In the proposed approach, the visual
modality is utilized to facilitate the separation for both stationary and
moving sources. The movement of the sources is detected by a 3-D
tracker based on particle filtering. The full BSS solution is formed
by integrating a frequency domain blind source separation algorithm
and beamforming: if the sources are identified as stationary, a frequency
domain BSS algorithm is implemented with an initialization
derived from the visual information. Once the sources are moving,
a beamforming algorithm is used to perform real time speech
enhancement and provide separation of the sources. Experimental
results show that by utilizing the visual modality, the proposed algorithm
can not only improve the performance of the BSS algorithm
and mitigate the permutation problem for stationary sources, but also
provide a good BSS performance for moving sources in a low reverberant
environment
A multimodal approach to blind source separation of moving sources
A novel multimodal approach is proposed to solve the
problem of blind source separation (BSS) of moving sources. The
challenge of BSS for moving sources is that the mixing filters are
time varying; thus, the unmixing filters should also be time varying,
which are difficult to calculate in real time. In the proposed approach,
the visual modality is utilized to facilitate the separation for
both stationary and moving sources. The movement of the sources
is detected by a 3-D tracker based on video cameras. Positions
and velocities of the sources are obtained from the 3-D tracker
based on a Markov Chain Monte Carlo particle filter (MCMC-PF),
which results in high sampling efficiency. The full BSS solution
is formed by integrating a frequency domain blind source separation
algorithm and beamforming: if the sources are identified
as stationary for a certain minimum period, a frequency domain
BSS algorithm is implemented with an initialization derived from
the positions of the source signals. Once the sources are moving, a
beamforming algorithm which requires no prior statistical knowledge
is used to perform real time speech enhancement and provide
separation of the sources. Experimental results confirm that
by utilizing the visual modality, the proposed algorithm not only
improves the performance of the BSS algorithm and mitigates the
permutation problem for stationary sources, but also provides a
good BSS performance for moving sources in a low reverberant
environment
Speech separation with dereverberation-based pre-processing incorporating visual cues
Humans are skilled in selectively extracting a single sound
source in the presence of multiple simultaneous sounds. They
(individuals with normal hearing) can also robustly adapt to
changing acoustic environments with great ease. Need has
arisen to incorporate such abilities in machines which would
enable multiple application areas such as human-computer
interaction, automatic speech recognition, hearing aids and
hands-free telephony. This work addresses the problem of
separating multiple speech sources in realistic reverberant
rooms using two microphones.
Different monaural and binaural cues have previously
been modeled in order to enable separation. Binaural spatial
cues i.e. the interaural level difference (ILD) and the inter-
aural phase difference (IPD) have been modeled [1] in the
time-frequency (TF) domain that exploit the differences in
the intensity and the phase of the mixture signals (because of
the different spatial locations) observed by two microphones
(or ears). The method performs well with no or little rever-
beration but as the amount of reverberation increases and the
sources approach each other, the binaural cues are distorted
and the interaural cues become indistinct, hence, degrading
the separation performance. Thus, there is a demand for
exploiting additional cues, and further signal processing is
required at higher levels of reverberation
Blind source extraction of heart sound signals from lung sound recordings exploiting periodicity of the heart sound
A novel approach for separating heart sound signals (HSSs) from lung sound recordings is presented. The approach is based on blind source extraction (BSE) with second-order statistics (SOS), which exploits the quasi-periodicity of the HSSs. The method is evaluated on both synthetic periodic signals of known period mixed with temporally white Gaussian noise (WGN) as well as on real quasi periodic HSSs mixed with lung sound signals (LSSs). Qualitative evaluation involving comparison of the power spectral densities (PSDs) of the extracted signals, by the proposed method and by the JADE algorithm, and that of the original signal is performed for the case of real data. Separation results confirm the utility of the proposed approach, although departure from strict periodicity may impact performance
Evaluation of emerging frequency domain convolutive blind source separation algorithms based on real room recordings
This paper presents a comparative study of three of the emerging frequency domain convolutive blind source separation (FDCBSS) techniques i.e. convolutive blind separation of non-stationary sources due to Parra and Spence, penalty function-based joint diagonalization approach for convolutive blind separation of nonstationary sources due to Wang et al. and a geometrically constrained multimodal approach for convolutive blind source separation due to Sanei et al. Objective evaluation is performed on the basis of signal to interference ratio (SIR), performance index (PI) and solution to the permutation problem. The results confirm that a multimodal approach is necessary to properly mitigate the permutation in BSS and ultimately to solve the cocktail party problem. In other words, it is to make BSS semiblind by exploiting prior geometrical information, and thereby providing the framework to find robust solutions for more challenging source separation with moving speakers
Convolutive speech separation by combining probabilistic models employing the interaural spatial cues and properties of the room assisted by vision
In this paper a new combination of the model of the
interaural spatial cues and a model that utilizes spatial properties
of the sources is proposed to enhance speech separation in
reverberant environments. The algorithm exploits the knowledge
of the locations of the speech sources estimated through vision.
The interaural phase difference, the interaural level difference
and the contribution of each source to all mixture channels are
each modeled as Gaussian distributions in the time-frequency
domain and evaluated at individual time-frequency points. An
expectation-maximization (EM) algorithm is employed to refine
the estimates of the parameters of the models. The algorithm outputs
enhanced time-frequency masks that are used to reconstruct
individual speech sources. Experimental results confirm that the
combined video-assisted method is promising to separate sources
in real reverberant rooms
Video-aided model-based source separation in real reverberant rooms
Source separation algorithms that utilize only audio
data can perform poorly if multiple sources or reverberation
are present. In this paper we therefore propose a video-aided
model-based source separation algorithm for a two-channel
reverberant recording in which the sources are assumed static.
By exploiting cues from video, we first localize individual speech
sources in the enclosure and then estimate their directions.
The interaural spatial cues, the interaural phase difference and
the interaural level difference, as well as the mixing vectors
are probabilistically modeled. The models make use of the
source direction information and are evaluated at discrete timefrequency
points. The model parameters are refined with the wellknown
expectation-maximization (EM) algorithm. The algorithm
outputs time-frequency masks that are used to reconstruct the
individual sources. Simulation results show that by utilizing the
visual modality the proposed algorithm can produce better timefrequency
masks thereby giving improved source estimates. We
provide experimental results to test the proposed algorithm in
different scenarios and provide comparisons with both other
audio-only and audio-visual algorithms and achieve improved
performance both on synthetic and real data. We also include
dereverberation based pre-processing in our algorithm in order
to suppress the late reverberant components from the observed
stereo mixture and further enhance the overall output of the algorithm.
This advantage makes our algorithm a suitable candidate
for use in under-determined highly reverberant settings where
the performance of other audio-only and audio-visual methods
is limited