49,781 research outputs found
Recommended from our members
Towards a tool for the subjective assessment of speech system interfaces (SASSI)
Applications of speech recognition are now widespread, but user-centred evaluation methods are necessary to ensure their success. Objective evaluation techniques are fairly well established, but previous subjective techniques have been unstructured and unproven. This paper reports on the first stage of the development of a questionnaire measure for the Subjective Assessment of Speech System Interfaces (SASSI). The aim of the research programme is to produce a valid, reliable and sensitive measure of users' subjective experiences with speech recognition systems. Such a technique could make an important contribution to theory and practice in the design and evaluation of speech recognition systems according to best human factors practice. A prototype questionnaire was designed, based on established measures for evaluating the usability of other kinds of user interface, and on a review of the research literature into speech system design. This consisted of 50 statements with which respondents rated their level of agreement. The questionnaire was given to users of four different speech applications, and Exploratory Factor Analysis of 214 completed questionnaires was conducted. This suggested the presence of six main factors in users' perceptions of speech systems: System Response Accuracy, Likeability, Cognitive Demand, Annoyance, Habitability and Speed. The six factors have face validity, and a reasonable level of statistical reliability. The findings form a userful theoretical and practical basis for the subjective evaluation of any speech recognition interface. However, further work is recommended, to establish the validity and sensitivity of the approach, before a final tool can be produced which warrants general use
Speech Recognition on an FPGA Using Discrete and Continuous Hidden Markov Models
Speech recognition is a computationally demanding task, particularly the stage which uses Viterbi decoding for converting pre-processed speech data into words or sub-word units. Any device that can reduce the load on, for example, a PC’s processor, is advantageous. Hence we present FPGA implementations of the decoder based alternately on discrete and continuous hidden Markov models (HMMs) representing monophones, and demonstrate that the discrete version can process speech nearly 5,000 times real time, using just 12% of the slices of a Xilinx Virtex XCV1000, but with a lower recognition rate than the continuous implementation, which is 75 times faster than real time, and occupies 45% of the same device
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
Evaluating Competing Agent Strategies for a Voice Email Agent
This paper reports experimental results comparing a mixed-initiative to a
system-initiative dialog strategy in the context of a personal voice email
agent. To independently test the effects of dialog strategy and user expertise,
users interact with either the system-initiative or the mixed-initiative agent
to perform three successive tasks which are identical for both agents. We
report performance comparisons across agent strategies as well as over tasks.
This evaluation utilizes and tests the PARADISE evaluation framework, and
discusses the performance function derivable from the experimental data.Comment: 6 pages latex, uses icassp91.sty, psfi
Learning An Invariant Speech Representation
Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of the signal that are maximally
invariant to intraclass transformations and deformations. We propose an
extension of a theory for unsupervised learning of invariant visual
representations to the auditory domain and empirically evaluate its validity
for voiced speech sound classification. Our version of the theory requires the
memory-based, unsupervised storage of acoustic templates -- such as specific
phones or words -- together with all the transformations of each that normally
occur. A quasi-invariant representation for a speech segment can be obtained by
projecting it to each template orbit, i.e., the set of transformed signals, and
computing the associated one-dimensional empirical probability distributions.
The computations can be performed by modules of filtering and pooling, and
extended to hierarchical architectures. In this paper, we apply a single-layer,
multicomponent representation for phonemes and demonstrate improved accuracy
and decreased sample complexity for vowel classification compared to standard
spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure
- …