4 research outputs found
Stream Attention for far-field multi-microphone ASR
A stream attention framework has been applied to the posterior probabilities
of the deep neural network (DNN) to improve the far-field automatic speech
recognition (ASR) performance in the multi-microphone configuration. The stream
attention scheme has been realized through an attention vector, which is
derived by predicting the ASR performance from the phoneme posterior
distribution of individual microphone stream, focusing the recognizer's
attention to more reliable microphones. Investigation on the various ASR
performance measures has been carried out using the real recorded dataset.
Experiments results show that the proposed framework has yielded substantial
improvements in word error rate (WER)
Robust Multi-channel Speech Recognition using Frequency Aligned Network
Conventional speech enhancement technique such as beamforming has known
benefits for far-field speech recognition. Our own work in frequency-domain
multi-channel acoustic modeling has shown additional improvements by training a
spatial filtering layer jointly within an acoustic model. In this paper, we
further develop this idea and use frequency aligned network for robust
multi-channel automatic speech recognition (ASR). Unlike an affine layer in the
frequency domain, the proposed frequency aligned component prevents one
frequency bin influencing other frequency bins. We show that this modification
not only reduces the number of parameters in the model but also significantly
and improves the ASR performance. We investigate effects of frequency aligned
network through ASR experiments on the real-world far-field data where users
are interacting with an ASR system in uncontrolled acoustic environments. We
show that our multi-channel acoustic model with a frequency aligned network
shows up to 18% relative reduction in word error rate
RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data
With the improvement of medical data capturing, vast amount of continuous
patient monitoring data, e.g., electrocardiogram (ECG), real-time vital signs
and medications, become available for clinical decision support at intensive
care units (ICUs). However, it becomes increasingly challenging to model such
data, due to high density of the monitoring data, heterogeneous data types and
the requirement for interpretable models. Integration of these high-density
monitoring data with the discrete clinical events (including diagnosis,
medications, labs) is challenging but potentially rewarding since richness and
granularity in such multimodal data increase the possibilities for accurate
detection of complex problems and predicting outcomes (e.g., length of stay and
mortality). We propose Recurrent Attentive and Intensive Model (RAIM) for
jointly analyzing continuous monitoring data and discrete clinical events. RAIM
introduces an efficient attention mechanism for continuous monitoring data
(e.g., ECG), which is guided by discrete clinical events (e.g, medication
usage). We apply RAIM in predicting physiological decompensation and length of
stay in those critically ill patients at ICU. With evaluations on MIMIC- III
Waveform Database Matched Subset, we obtain an AUC-ROC score of 90.18% for
predicting decompensation and an accuracy of 86.82% for forecasting length of
stay with our final model, which outperforms our six baseline models
Distilling Knowledge Using Parallel Data for Far-field Speech Recognition
In order to improve the performance for far-field speech recognition, this
paper proposes to distill knowledge from the close-talking model to the
far-field model using parallel data. The close-talking model is called the
teacher model. The far-field model is called the student model. The student
model is trained to imitate the output distributions of the teacher model. This
constraint can be realized by minimizing the Kullback-Leibler (KL) divergence
between the output distribution of the student model and the teacher model.
Experimental results on AMI corpus show that the best student model achieves up
to 4.7% absolute word error rate (WER) reduction when compared with the
conventionally-trained baseline models