20 research outputs found
IANS: Intelligibility-aware Null-steering Beamforming for Dual-Microphone Arrays
Beamforming techniques are popular in speech-related applications due to
their effective spatial filtering capabilities. Nonetheless, conventional
beamforming techniques generally depend heavily on either the target's
direction-of-arrival (DOA), relative transfer function (RTF) or covariance
matrix. This paper presents a new approach, the intelligibility-aware
null-steering (IANS) beamforming framework, which uses the STOI-Net
intelligibility prediction model to improve speech intelligibility without
prior knowledge of the speech signal parameters mentioned earlier. The IANS
framework combines a null-steering beamformer (NSBF) to generate a set of
beamformed outputs, and STOI-Net, to determine the optimal result. Experimental
results indicate that IANS can produce intelligibility-enhanced signals using a
small dual-microphone array. The results are comparable to those obtained by
null-steering beamformers with given knowledge of DOAs.Comment: Preprint submitted to IEEE MLSP 202
Time-Domain Multi-modal Bone/air Conducted Speech Enhancement
Previous studies have proven that integrating video signals, as a
complementary modality, can facilitate improved performance for speech
enhancement (SE). However, video clips usually contain large amounts of data
and pose a high cost in terms of computational resources and thus may
complicate the SE system. As an alternative source, a bone-conducted speech
signal has a moderate data size while manifesting speech-phoneme structures,
and thus complements its air-conducted counterpart. In this study, we propose a
novel multi-modal SE structure in the time domain that leverages bone- and
air-conducted signals. In addition, we examine two ensemble-learning-based
strategies, early fusion (EF) and late fusion (LF), to integrate the two types
of speech signals, and adopt a deep learning-based fully convolutional network
to conduct the enhancement. The experiment results on the Mandarin corpus
indicate that this newly presented multi-modal (integrating bone- and
air-conducted signals) SE structure significantly outperforms the single-source
SE counterparts (with a bone- or air-conducted signal only) in various speech
evaluation metrics. In addition, the adoption of an LF strategy other than an
EF in this novel SE multi-modal structure achieves better results.Comment: multi-modal, bone/air-conducted signals, speech enhancement, fully
convolutional networ