2,174 research outputs found
Analyzing analytical methods: The case of phonology in neural models of spoken language
Given the fast development of analysis techniques for NLP and speech
processing systems, few systematic studies have been conducted to compare the
strengths and weaknesses of each method. As a step in this direction we study
the case of representations of phonology in neural network models of spoken
language. We use two commonly applied analytical techniques, diagnostic
classifiers and representational similarity analysis, to quantify to what
extent neural activation patterns encode phonemes and phoneme sequences. We
manipulate two factors that can affect the outcome of analysis. First, we
investigate the role of learning by comparing neural activations extracted from
trained versus randomly-initialized models. Second, we examine the temporal
scope of the activations by probing both local activations corresponding to a
few milliseconds of the speech signal, and global activations pooled over the
whole utterance. We conclude that reporting analysis results with randomly
initialized models is crucial, and that global-scope methods tend to yield more
consistent results and we recommend their use as a complement to local-scope
diagnostic methods.Comment: ACL 202
A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural models have become ubiquitous in automatic speech recognition systems.
While neural networks are typically used as acoustic models in more complex
systems, recent studies have explored end-to-end speech recognition systems
based on neural networks, which can be trained to directly predict text from
input acoustic features. Although such systems are conceptually elegant and
simpler than traditional systems, it is less obvious how to interpret the
trained models. In this work, we analyze the speech representations learned by
a deep end-to-end model that is based on convolutional and recurrent layers,
and trained with a connectionist temporal classification (CTC) loss. We use a
pre-trained model to generate frame-level features which are given to a
classifier that is trained on frame classification into phones. We evaluate
representations from different layers of the deep model and compare their
quality for predicting phone labels. Our experiments shed light on important
aspects of the end-to-end model such as layer depth, model complexity, and
other design choices.Comment: NIPS 201
Prosodic Representations of Prominence Classification Neural Networks and Autoencoders Using Bottleneck Features
Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.Peer reviewe
- …