3 research outputs found
Earnings-21: A Practical Benchmark for ASR in the Wild
Commonly used speech corpora inadequately challenge academic and commercial
ASR systems. In particular, speech corpora lack metadata needed for detailed
analysis and WER measurement. In response, we present Earnings-21, a 39-hour
corpus of earnings calls containing entity-dense speech from nine different
financial sectors. This corpus is intended to benchmark ASR systems in the wild
with special attention towards named entity recognition. We benchmark four
commercial ASR models, two internal models built with open-source tools, and an
open-source LibriSpeech model and discuss their differences in performance on
Earnings-21. Using our recently released fstalign tool, we provide a candid
analysis of each model's recognition capabilities under different partitions.
Our analysis finds that ASR accuracy for certain NER categories is poor,
presenting a significant impediment to transcript comprehension and usage.
Earnings-21 bridges academic and commercial ASR system evaluation and enables
further research on entity modeling and WER on real world audio.Comment: Accepted to INTERSPEECH 2021. June 15 2021: Addressing the comments
of reviewers and updating the results of our internal ESPNet model. The
results do not change our conclusions. April 28th, 2021: We found and
resolved an issue in our experimental evaluation that scored the LibriSpeech
model at ~20% worse relative WER than the actual WER. The updated results do
not affect our conclusion
Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation
Utterance-level permutation invariant training (uPIT) has achieved promising
progress on single-channel multi-talker speech separation task. Long short-term
memory (LSTM) and bidirectional LSTM (BLSTM) are widely used as the separation
networks of uPIT, i.e. uPIT-LSTM and uPIT-BLSTM. uPIT-LSTM has lower latency
but worse performance, while uPIT-BLSTM has better performance but higher
latency. In this paper, we propose using latency-controlled BLSTM (LC-BLSTM)
during inference to fulfill low-latency and good-performance speech separation.
To find a better training strategy for BLSTM-based separation network,
chunk-level PIT (cPIT) and uPIT are compared. The experimental results show
that uPIT outperforms cPIT when LC-BLSTM is used during inference. It is also
found that the inter-chunk speaker tracing (ST) can further improve the
separation performance of uPIT-LC-BLSTM. Evaluated on the WSJ0 two-talker
mixed-speech separation task, the absolute gap of signal-to-distortion ratio
(SDR) between uPIT-BLSTM and uPIT-LC-BLSTM is reduced to within 0.7 dB.Comment: Proceedings of APSIPA Annual Summit and Conference 2019, 18-21
November 2019, Lanzhou, Chin
Deep Learning for Human Affect Recognition: Insights and New Developments
Automatic human affect recognition is a key step towards more natural
human-computer interaction. Recent trends include recognition in the wild using
a fusion of audiovisual and physiological sensors, a challenging setting for
conventional machine learning algorithms. Since 2010, novel deep learning
algorithms have been applied increasingly in this field. In this paper, we
review the literature on human affect recognition between 2010 and 2017, with a
special focus on approaches using deep neural networks. By classifying a total
of 950 studies according to their usage of shallow or deep architectures, we
are able to show a trend towards deep learning. Reviewing a subset of 233
studies that employ deep neural networks, we comprehensively quantify their
applications in this field. We find that deep learning is used for learning of
(i) spatial feature representations, (ii) temporal feature representations, and
(iii) joint feature representations for multimodal sensor data. Exemplary
state-of-the-art architectures illustrate the progress. Our findings show the
role deep architectures will play in human affect recognition, and can serve as
a reference point for researchers working on related applications.Comment: To be published in IEEE Transactions on Affective Computing. 20
pages, 7 figures, 6 table