105,600 research outputs found
Soft margin estimation for automatic speech recognition
In this study, a new discriminative learning framework, called soft margin estimation (SME), is proposed for estimating the parameters of continuous density hidden Markov models (HMMs). The proposed method makes direct use of the successful ideas of margin in support vector machines to improve generalization capability and decision feedback learning in discriminative training to enhance model separation in classifier design. SME directly maximizes the separation of competing models to enhance the testing samples to approach a correct decision if the deviation from training samples is within a safe margin. Frame and utterance selections are integrated into a unified framework to select the training utterances and frames critical for discriminating competing models. SME offers a flexible and rigorous framework to facilitate the incorporation of new margin-based optimization criteria into HMMs training. The choice of various loss functions is illustrated and different kinds of separation measures are defined under a unified SME framework. SME is also shown to be able to jointly optimize feature extraction and HMMs. Both the generalized probabilistic descent algorithm and the Extended Baum-Welch algorithm are applied to solve SME.
SME has demonstrated its great advantage over other discriminative training methods in several speech recognition tasks. Tested on the TIDIGITS digit recognition task, the proposed SME approach achieves a string accuracy of 99.61%, the best result ever reported in literature. On the 5k-word Wall Street Journal task, SME reduced the word error rate (WER) from 5.06% of MLE models to 3.81%, with relative 25% WER reduction. This is the first attempt to show the effectiveness of margin-based acoustic modeling for large vocabulary continuous speech recognition in a HMMs framework. The generalization of SME was also well demonstrated on the Aurora 2 robust speech recognition task, with around 30% relative WER reduction from the clean-trained baseline.Ph.D.Committee Chair: Dr. Chin-Hui Lee; Committee Member: Dr. Anthony Joseph Yezzi; Committee Member: Dr. Biing-Hwang (Fred) Juang; Committee Member: Dr. Mark Clements; Committee Member: Dr. Ming Yua
Preserving background sound in noise-robust voice conversion via multi-task learning
Background sound is an informative form of art that is helpful in providing a
more immersive experience in real-application voice conversion (VC) scenarios.
However, prior research about VC, mainly focusing on clean voices, pay rare
attention to VC with background sound. The critical problem for preserving
background sound in VC is inevitable speech distortion by the neural separation
model and the cascade mismatch between the source separation model and the VC
model. In this paper, we propose an end-to-end framework via multi-task
learning which sequentially cascades a source separation (SS) module, a
bottleneck feature extraction module and a VC module. Specifically, the source
separation task explicitly considers critical phase information and confines
the distortion caused by the imperfect separation process. The source
separation task, the typical VC task and the unified task shares a uniform
reconstruction loss constrained by joint training to reduce the mismatch
between the SS and VC modules. Experimental results demonstrate that our
proposed framework significantly outperforms the baseline systems while
achieving comparable quality and speaker similarity to the VC models trained
with clean data.Comment: Submitted to ICASSP 202
Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization
Audio-visual learning has demonstrated promising results in many classical
speech tasks (e.g., speech separation, automatic speech recognition, wake-word
spotting). We believe that introducing visual modality will also benefit
speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD)
plays an important role in highly accurate speaker diarization. However,
previous TS-VAD models take audio features and utilize the speaker's acoustic
footprint to distinguish his or her personal speech activities, which is easily
affected by overlapped speech in multi-speaker scenarios. Although visual
information naturally tolerates overlapped speech, it suffers from spatial
occlusion, low resolution, etc. The potential modality-missing problem blocks
TS-VAD towards an audio-visual approach.
This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice
Activity Detection (MIMO-TSVAD) framework for speaker diarization. The proposed
method can take audio-visual input and leverage the speaker's acoustic
footprint or lip track to flexibly conduct audio-based, video-based, and
audio-visual speaker diarization in a unified sequence-to-sequence framework.
Experimental results show that the MIMO-TSVAD framework demonstrates
state-of-the-art performance on the VoxConverse, DIHARD-III, and MISP 2022
datasets under corresponding evaluation metrics, obtaining the Diarization
Error Rates (DERs) of 4.18%, 10.10%, and 8.15%, respectively. In addition, it
can perform robustly in heavy lip-missing scenarios.Comment: Under review of IEEE/ACM Transactions on Audio, Speech, and Language
Processin
CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing
We address the problem of cross-modal information retrieval in the domain of
remote sensing. In particular, we are interested in two application scenarios:
i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery,
and ii) multi-label image retrieval between very high resolution (VHR) images
and speech based label annotations. Notice that these multi-modal retrieval
scenarios are more challenging than the traditional uni-modal retrieval
approaches given the inherent differences in distributions between the
modalities. However, with the growing availability of multi-source remote
sensing data and the scarcity of enough semantic annotations, the task of
multi-modal retrieval has recently become extremely important. In this regard,
we propose a novel deep neural network based architecture which is considered
to learn a discriminative shared feature space for all the input modalities,
suitable for semantically coherent information retrieval. Extensive experiments
are carried out on the benchmark large-scale PAN - multi-spectral DSRSID
dataset and the multi-label UC-Merced dataset. Together with the Merced
dataset, we generate a corpus of speech signals corresponding to the labels.
Superior performance with respect to the current state-of-the-art is observed
in all the cases
- …