9,824 research outputs found
Modeling musicological information as trigrams in a system for simultaneous chord and local key extraction
In this paper, we discuss the introduction of a trigram musicological model in a simultaneous chord and local key extraction system. By enlarging the context of the musicological model, we hoped to achieve a higher accuracy that could justify the associated higher complexity and computational load of the search for the optimal solution. Experiments on multiple data sets have demonstrated that the trigram model has indeed a larger predictive power (a lower perplexity). This raised predictive power resulted in an improvement in the key extraction capabilities, but no improvement in chord extraction when compared to a system with a bigram musicological model
Lessons from Building Acoustic Models with a Million Hours of Speech
This is a report of our lessons learned building acoustic models from 1
Million hours of unlabeled speech, while labeled speech is restricted to 7,000
hours. We employ student/teacher training on unlabeled data, helping scale out
target generation in comparison to confidence model based methods, which
require a decoder and a confidence model. To optimize storage and to
parallelize target generation, we store high valued logits from the teacher
model. Introducing the notion of scheduled learning, we interleave learning on
unlabeled and labeled data. To scale distributed training across a large number
of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on
labeled data with gradient threshold compression SGD using 16 GPUs. Our
experiments show that extremely large amounts of data are indeed useful; with
little hyper-parameter tuning, we obtain relative WER improvements in the 10 to
20% range, with higher gains in noisier conditions.Comment: "Copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works.
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
RNN Language Model with Word Clustering and Class-based Output Layer
The recurrent neural network language model (RNNLM) has shown significant promise for statistical language modeling. In this work, a new class-based output layer method is introduced to further improve the RNNLM. In this method, word class information is incorporated into the output layer by utilizing the Brown clustering algorithm to estimate a class-based language model. Experimental results show that the new output layer with word clustering not only improves the convergence obviously but also reduces the perplexity and word error rate in large vocabulary continuous speech recognition
Variable Word Rate N-grams
The rate of occurrence of words is not uniform but varies from document to
document. Despite this observation, parameters for conventional n-gram language
models are usually derived using the assumption of a constant word rate. In
this paper we investigate the use of variable word rate assumption, modelled by
a Poisson distribution or a continuous mixture of Poissons. We present an
approach to estimating the relative frequencies of words or n-grams taking
prior information of their occurrences into account. Discounting and smoothing
schemes are also considered. Using the Broadcast News task, the approach
demonstrates a reduction of perplexity up to 10%.Comment: 4 pages, 4 figures, ICASSP-200
- …