277,378 research outputs found
Named entity extraction for speech
Named entity extraction is a field that has generated much interest over recent years
with the explosion of the World Wide Web and the necessity for accurate information
retrieval. Named entity extraction, the task of finding specific entities within documents,
has proven of great benefit for numerous information extraction and information retrieval
tasks.As well as multiple language evaluations, named entity extraction has been investigated
on a variety of media forms with varying success. In general, these media forms
have all been based upon standard text and assumed that any variation from standard
text constitutes noise.We investigate how it is possible to find named entities in speech data.. Where
others have focussed on applying named entity extraction techniques to transcriptions
of speech, we investigate a method for finding the named entities direct from the word
lattices associated with the speech signal. The results show that it is possible to improve
named entity recognition at the expense of word error rate (WER) in contrast to the
general view that F -score is directly proportional to WER.We use a. Hidden Markov Model {HMM) style approach to the task of named entity
extraction and show how it is possible to utilise a HMM to find named entities
within speech lattices. We further investigate how it is possible to improve results by
considering an alternative derivation of the joint probability of words and entities than
is traditionally used. This new derivation is particularly appropriate to speech lattices
as no presumptions are made about the sequence of words.The HMM style approach that we use requires using a number of language models
in parallel. We have developed a system for discriminately retraining these language
models based upon the results of the output, and we show how it is possible to improve
named entity recognition by iterations over both training data and development data.
We also consider how part-of-speech (POS) can be used within word lattices. We
devise a method of labelling a word lattice with POS tags and adapt the model to make
use of these POS tags when producing the best path through the lattice. The resulting
path provides the most likely sequence of words, entities and POS tags and we show
how this new path is better than the previous path which ignored the POS tags
Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition
A human-computer interface is developed to provide services of computer assisted machine translation (CAT) and computer assisted transcription of handwritten text images (CATTI). The back-end machine translation (MT) and handwritten text recognition (HTR) systems are provided by the Pattern Recognition and Human Language Technology (PRHLT) research group. The idea is to provide users with easy to use tools to convert interactive translation and transcription feasible tasks. The assisted service is provided by remote servers with CAT or CATTI capabilities. The interface supplies the user with tools for efficient local edition: deletion, insertion and substitution.Ocampo Sepúlveda, JC. (2009). Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition. http://hdl.handle.net/10251/14318Archivo delegad
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
In this work we present a framework for the recognition of natural scene
text. Our framework does not require any human-labelled data, and performs word
recognition on the whole image holistically, departing from the character based
recognition systems of the past. The deep neural network models at the centre
of this framework are trained solely on data produced by a synthetic text
generation engine -- synthetic data that is highly realistic and sufficient to
replace real data, giving us infinite amounts of training data. This excess of
data exposes new possibilities for word recognition models, and here we
consider three models, each one "reading" words in a different way: via 90k-way
dictionary encoding, character sequence encoding, and bag-of-N-grams encoding.
In the scenarios of language based and completely unconstrained text
recognition we greatly improve upon state-of-the-art performance on standard
datasets, using our fast, simple machinery and requiring zero data-acquisition
costs
Combining Multiple Views for Visual Speech Recognition
Visual speech recognition is a challenging research problem with a particular
practical application of aiding audio speech recognition in noisy scenarios.
Multiple camera setups can be beneficial for the visual speech recognition
systems in terms of improved performance and robustness. In this paper, we
explore this aspect and provide a comprehensive study on combining multiple
views for visual speech recognition. The thorough analysis covers fusion of all
possible view angle combinations both at feature level and decision level. The
employed visual speech recognition system in this study extracts features
through a PCA-based convolutional neural network, followed by an LSTM network.
Finally, these features are processed in a tandem system, being fed into a
GMM-HMM scheme. The decision fusion acts after this point by combining the
Viterbi path log-likelihoods. The results show that the complementary
information contained in recordings from different view angles improves the
results significantly. For example, the sentence correctness on the test set is
increased from 76% for the highest performing single view () to up to
83% when combining this view with the frontal and view angles
Reading Scene Text in Deep Convolutional Sequences
We develop a Deep-Text Recurrent Network (DTRN) that regards scene text
reading as a sequence labelling problem. We leverage recent advances of deep
convolutional neural networks to generate an ordered high-level sequence from a
whole word image, avoiding the difficult character segmentation problem. Then a
deep recurrent model, building on long short-term memory (LSTM), is developed
to robustly recognize the generated CNN sequences, departing from most existing
approaches recognising each character independently. Our model has a number of
appealing properties in comparison to existing scene text recognition methods:
(i) It can recognise highly ambiguous words by leveraging meaningful context
information, allowing it to work reliably without either pre- or
post-processing; (ii) the deep CNN feature is robust to various image
distortions; (iii) it retains the explicit order information in word image,
which is essential to discriminate word strings; (iv) the model does not depend
on pre-defined dictionary, and it can process unknown words and arbitrary
strings. Codes for the DTRN will be available.Comment: To appear in the 13th AAAI Conference on Artificial Intelligence
(AAAI-16), 201
Design and implementation of a user-oriented speech recognition interface: the synergy of technology and human factors
The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology and human factors is achieved. This synergy is very important for making speech interfaces a natural and acceptable form of human-machine interaction. Important concepts such as interfaces, human factors and speech recognition are discussed. Additionally, an indication is given as to how the synergy of human factors and technology can be realised by a sketch of the interface's implementation. An explanation is also provided of how the interface might be integrated in different applications fruitfully
- …