329 research outputs found
Discriminative Segmental Cascades for Feature-Rich Phone Recognition
Discriminative segmental models, such as segmental conditional random fields
(SCRFs) and segmental structured support vector machines (SSVMs), have had
success in speech recognition via both lattice rescoring and first-pass
decoding. However, such models suffer from slow decoding, hampering the use of
computationally expensive features, such as segment neural networks or other
high-order features. A typical solution is to use approximate decoding, either
by beam pruning in a single pass or by beam pruning to generate a lattice
followed by a second pass. In this work, we study discriminative segmental
models trained with a hinge loss (i.e., segmental structured SVMs). We show
that beam search is not suitable for learning rescoring models in this
approach, though it gives good approximate decoding performance when the model
is already well-trained. Instead, we consider an approach inspired by
structured prediction cascades, which use max-marginal pruning to generate
lattices. We obtain a high-accuracy phonetic recognition system with several
expensive feature types: a segment neural network, a second-order language
model, and second-order phone boundary features
Time and frequency domain algorithms for speech coding
The promise of digital hardware economies (due to recent advances in
VLSI technology), has focussed much attention on more complex and sophisticated
speech coding algorithms which offer improved quality at relatively
low bit rates.
This thesis describes the results (obtained from computer simulations)
of research into various efficient (time and frequency domain) speech
encoders operating at a transmission bit rate of 16 Kbps.
In the time domain, Adaptive Differential Pulse Code Modulation (ADPCM)
systems employing both forward and backward adaptive prediction were
examined. A number of algorithms were proposed and evaluated, including
several variants of the Stochastic Approximation Predictor (SAP). A
Backward Block Adaptive (BBA) predictor was also developed and found to
outperform the conventional stochastic methods, even though its complexity
in terms of signal processing requirements is lower. A simplified
Adaptive Predictive Coder (APC) employing a single tap pitch predictor
considered next provided a slight improvement in performance over ADPCM,
but with rather greater complexity.
The ultimate test of any speech coding system is the perceptual performance
of the received speech. Recent research has indicated that this
may be enhanced by suitable control of the noise spectrum according to
the theory of auditory masking. Various noise shaping ADPCM
configurations were examined, and it was demonstrated that a proposed
pre-/post-filtering arrangement which exploits advantageously the
predictor-quantizer interaction, leads to the best subjective
performance in both forward and backward prediction systems.
Adaptive quantization is instrumental to the performance of ADPCM systems.
Both the forward adaptive quantizer (AQF) and the backward oneword
memory adaptation (AQJ) were examined. In addition, a novel method
of decreasing quantization noise in ADPCM-AQJ coders, which involves the
application of correction to the decoded speech samples, provided
reduced output noise across the spectrum, with considerable high frequency
noise suppression.
More powerful (and inevitably more complex) frequency domain speech
coders such as the Adaptive Transform Coder (ATC) and the Sub-band Coder
(SBC) offer good quality speech at 16 Kbps. To reduce complexity and
coding delay, whilst retaining the advantage of sub-band coding, a novel
transform based split-band coder (TSBC) was developed and found to compare
closely in performance with the SBC.
To prevent the heavy side information requirement associated with a
large number of bands in split-band coding schemes from impairing coding
accuracy, without forgoing the efficiency provided by adaptive bit
allocation, a method employing AQJs to code the sub-band signals together
with vector quantization of the bit allocation patterns was also
proposed.
Finally, 'pipeline' methods of bit allocation and step size estimation
(using the Fast Fourier Transform (FFT) on the input signal) were examined.
Such methods, although less accurate, are nevertheless useful in
limiting coding delay associated with SRC schemes employing Quadrature
Mirror Filters (QMF)
Finding acoustic regularities in speech : applications to phonetic recognition
Also issued as Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1988.Includes bibliographical references.Supported by the National Sciences and Engineering Research Council of Canada and the Defense Advanced Research Projects Agency.James Robert Glass
Motion Compatibility for Indoor Localization
Indoor localization -- a device's ability to determine its location within an extended indoor environment -- is a fundamental enabling capability for mobile context-aware applications. Many proposed applications assume localization information from GPS, or from WiFi access points. However, GPS fails indoors and in urban canyons, and current WiFi-based methods require an expensive, and manually intensive, mapping, calibration, and configuration process performed by skilled technicians to bring the system online for end users. We describe a method that estimates indoor location with respect to a prior map consisting of a set of 2D floorplans linked through horizontal and vertical adjacencies. Our main contribution is the notion of "path compatibility," in which the sequential output of a classifier of inertial data producing low-level motion estimates (standing still, walking straight, going upstairs, turning left etc.) is examined for agreement with the prior map. Path compatibility is encoded in an HMM-based matching model, from which the method recovers the user s location trajectory from the low-level motion estimates. To recognize user motions, we present a motion labeling algorithm, extracting fine-grained user motions from sensor data of handheld mobile devices. We propose "feature templates," which allows the motion classifier to learn the optimal window size for a specific combination of a motion and a sensor feature function. We show that, using only proprioceptive data of the quality typically available on a modern smartphone, our motion labeling algorithm classifies user motions with 94.5% accuracy, and our trajectory matching algorithm can recover the user's location to within 5 meters on average after one minute of movements from an unknown starting location. Prior information, such as a known starting floor, further decreases the time required to obtain precise location estimate
- …