251 research outputs found
Support Vector Machines for Speech Recognition
Hidden Markov models (HMM) with Gaussian mixture observation densities are the dominant approach in speech recognition. These systems typically use a representational model for acoustic modeling which can often be prone to overfitting and does not translate to improved discrimination. We propose a new paradigm centered on principles of structural risk minimization using a discriminative framework for speech recognition based on support vector machines (SVMs). SVMs have the ability to simultaneously optimize the representational and discriminative ability of the acoustic classifiers. We have developed the first SVM-based large vocabulary speech recognition system that improves performance over traditional HMM-based systems. This hybrid system achieves a state-of-the-art word error rate of 10.6% on a continuous alphadigit task ? a 10% improvement relative to an HMM system. On SWITCHBOARD, a large vocabulary task, the system improves performance over a traditional HMM system from 41.6% word error rate to 40.6%. This dissertation discusses several practical issues that arise when SVMs are incorporated into the hybrid system
A Cascaded Broadcast News Highlighter
This paper presents a fully automatic news skimming system which takes a broadcast news audio stream and provides the user with the segmented, structured and highlighted transcript. This constitutes a system with three different, cascading stages: converting the audio stream to text using an automatic speech recogniser, segmenting into utterances and stories and finally determining which utterance should be highlighted using a saliency score. Each stage must operate on the erroneous output from the previous stage in the system; an effect which is naturally amplified as the data progresses through the processing stages. We present a large corpus of transcribed broadcast news data enabling us to investigate to which degree information worth highlighting survives this cascading of processes. Both extrinsic and intrinsic experimental results indicate that mistakes in the story boundary detection has a strong impact on the quality of highlights, whereas erroneous utterance boundaries cause only minor problems. Further, the difference in transcription quality does not affect the overall performance greatly
Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN
This paper is concerned with the automatic recognition of dialogue acts (DAs) in multiparty conversational speech. We present a joint generative model for DA recognition in which segmentation and classification of DAs are carried out in parallel. Our approach to DA recognition is based on a switching dynamic Bayesian network (DBN) architecture. This generative approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. The switching DBN coordinates the recognition process by integrating the component models. The factored language model, which is estimated from multiple conversational data corpora, is used in conjunction with additional task-specific language models. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. We have carried out experiments on the AMI corpus of multimodal meeting recordings, using both manually transcribed speech, and the output of an automatic speech recognizer, and using different configurations of the generative model. Our results indicate that the system performs well both on reference and fully automatic transcriptions. A further significant improvement in recognition accuracy is obtained by the application of the discriminative reranking approach based on conditional random fields
Toward summarization of communicative activities in spoken conversation
This thesis is an inquiry into the nature and structure of face-to-face conversation, with a
special focus on group meetings in the workplace. I argue that conversations are composed
of episodes, each of which corresponds to an identifiable communicative activity such as
giving instructions or telling a story. These activities are important because they are part
of participantsâ commonsense understanding of what happens in a conversation. They
appear in natural summaries of conversations such as meeting minutes, and participants
talk about them within the conversation itself. Episodic communicative activities therefore
represent an essential component of practical, commonsense descriptions of conversations.
The thesis objective is to provide a deeper understanding of how such activities may be
recognized and differentiated from one another, and to develop a computational method
for doing so automatically. The experiments are thus intended as initial steps toward future
applications that will require analysis of such activities, such as an automatic minute-taker
for workplace meetings, a browser for broadcast news archives, or an automatic decision
mapper for planning interactions.
My main theoretical contribution is to propose a novel analytical framework called participant
relational analysis. The proposal argues that communicative activities are principally
indicated through participant-relational features, i.e., expressions of relationships between
participants and the dialogue. Participant-relational features, such as subjective language,
verbal reference to the participants, and the distribution of speech activity amongst
the participants, are therefore argued to be a principal means for analyzing the nature and
structure of communicative activities.
I then apply the proposed framework to two computational problems: automatic discourse
segmentation and automatic discourse segment labeling. The first set of experiments
test whether participant-relational features can serve as a basis for automatically
segmenting conversations into discourse segments, e.g., activity episodes. Results show
that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links
between content words, i.e., lexical cohesion. They also show that feature performance is
highly dependent on segment type, suggesting that human-annotated âtopic segmentsâ are
in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units.
Analysis of commonly used evaluation measures, performed in conjunction with the
segmentation experiments, reveals that they fail to penalize substantially defective results
due to inherent biases in the measures. I therefore preface the experiments with a comprehensive
analysis of these biases and a proposal for a novel evaluation measure. A reevaluation
of state-of-the-art segmentation algorithms using the novel measure produces
substantially different results from previous studies. This raises serious questions about the
effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate
ones to employ in the subsequent experiments.
I also preface the experiments with an investigation of participant reference, an important
type of participant-relational feature. I propose an annotation scheme with novel distinctions
for vagueness, discourse function, and addressing-based referent inclusion, each
of which are assessed for inter-coder reliability. The produced dataset includes annotations
of 11,000 occasions of person-referring.
The second set of experiments concern the use of participant-relational features to
automatically identify labels for discourse segments. In contrast to assigning semantic topic
labels, such as topical headlines, the proposed algorithm automatically labels segments
according to activity type, e.g., presentation, discussion, and evaluation. The method is
unsupervised and does not learn from annotated ground truth labels. Rather, it induces the
labels through correlations between discourse segment boundaries and the occurrence of
bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what
has just occurred or what is about to occur. Results show that bracketing meta-discourse
is an effective basis for identifying some labels automatically, but that its use is limited if
global correlations to segment features are not employed.
This thesis addresses important pre-requisites to the automatic summarization of conversation.
What I provide is a novel activity-oriented perspective on how summarization
should be approached, and a novel participant-relational approach to conversational analysis.
The experimental results show that analysis of participant-relational features is
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
The UEDIN ASR Systems for the IWSLT 2014 Evaluation
This paper describes the University of Edinburgh (UEDIN) ASR systems for the 2014 IWSLT Evaluation. Notable fea-tures of the English system include deep neural network acoustic models in both tandem and hybrid configuration with the use of multi-level adaptive networks, LHUC adapta-tion and Maxout units. The German system includes lightly supervised training and a new method for dictionary gener-ation. Our voice activity detection system now uses a semi-Markov model to incorporate a prior on utterance lengths. There are improvements of up to 30 % relative WER on the tst2013 English test set. 1
- âŠ