3,942 research outputs found
Attention-Based End-to-End Speech Recognition on Voice Search
Recently, there has been a growing interest in end-to-end speech recognition
that directly transcribes speech to text without any predefined alignments. In
this paper, we explore the use of attention-based encoder-decoder model for
Mandarin speech recognition on a voice search task. Previous attempts have
shown that applying attention-based encoder-decoder to Mandarin speech
recognition was quite difficult due to the logographic orthography of Mandarin,
the large vocabulary and the conditional dependency of the attention model. In
this paper, we use character embedding to deal with the large vocabulary.
Several tricks are used for effective model training, including L2
regularization, Gaussian weight noise and frame skipping. We compare two
attention mechanisms and use attention smoothing to cover long context in the
attention model. Taken together, these tricks allow us to finally achieve a
character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on
the MiTV voice search dataset. While together with a trigram language model,
CER and SER reach 2.81% and 5.77%, respectively
Chinese Spoken Document Summarization Using Probabilistic Latent Topical Information
[[abstract]]The purpose of extractive summarization is to automatically select a number of indicative sentences, passages, or paragraphs from the original document according to a target summarization ratio and then sequence them to form a concise summary. In the paper, we proposed the use of probabilistic latent topical information for extractive summarization of spoken documents. Various kinds of modeling structures and learning approaches were extensively investigated. In addition, the summarization capabilities were verified by comparison with the conventional vector space model and latent semantic indexing model, as well as the HMM model. The experiments were performed on the Chinese broadcast news collected in Taiwan. Noticeable performance gains were obtained.
Order-Preserving Abstractive Summarization for Spoken Content Based on Connectionist Temporal Classification
Connectionist temporal classification (CTC) is a powerful approach for
sequence-to-sequence learning, and has been popularly used in speech
recognition. The central ideas of CTC include adding a label "blank" during
training. With this mechanism, CTC eliminates the need of segment alignment,
and hence has been applied to various sequence-to-sequence learning problems.
In this work, we applied CTC to abstractive summarization for spoken content.
The "blank" in this case implies the corresponding input data are less
important or noisy; thus it can be ignored. This approach was shown to
outperform the existing methods in term of ROUGE scores over Chinese Gigaword
and MATBN corpora. This approach also has the nice property that the ordering
of words or characters in the input documents can be better preserved in the
generated summaries.Comment: Accepted by Interspeech 201
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
English Broadcast News Speech Recognition by Humans and Machines
With recent advances in deep learning, considerable attention has been given
to achieving automatic speech recognition performance close to human
performance on tasks like conversational telephone speech (CTS) recognition. In
this paper we evaluate the usefulness of these proposed techniques on broadcast
news (BN), a similar challenging task. We also perform a set of recognition
measurements to understand how close the achieved automatic speech recognition
results are to human performance on this task. On two publicly available BN
test sets, DEV04F and RT04, our speech recognition system using LSTM and
residual network based acoustic models with a combination of n-gram and neural
network language models performs at 6.5% and 5.9% word error rate. By achieving
new performance milestones on these test sets, our experiments show that
techniques developed on other related tasks, like CTS, can be transferred to
achieve similar performance. In contrast, the best measured human recognition
performance on these test sets is much lower, at 3.6% and 2.8% respectively,
indicating that there is still room for new techniques and improvements in this
space, to reach human performance levels.Comment: \copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
- …