1,047 research outputs found
Exploring Automatic Speech Recognition with TensorFlow
Speech Recognition (reconocimiento de voz) es la tarea que pretende indentificar palabras habladas y convertirlas a texto. Este trabajo de fin de grado se centra en utilizar técnicas de deep learning para construir un sistema de Speech Recognition entrenándolo end-to-end. Como paso preliminar, hacemos un resumen de los métodos más relevantes llevados a cabo los últimos años. A continuación estudiamos uno de los trabajos más recientes en este área que propone un modelo sequence to sequence con atención entrenado end-to-end. Después, reproducimos satisfactoriamente el modelo y lo avaluamos con la base de datos TIMIT. Analizamos los parecidos y diferencias entre la implementación propuesta y el trabajo teórico original. Y finalmente, experimentamos y contrastamos el modelo utilizando diferentes parámetros (e.g. numero de neuronas por capa, la tasa de aprendizaje -learning rate y los batch sizes) y reducimos el Phoneme Error Rate cerca del 12% relativo.Speech recognition is the task aiming to identify words in spoken language and convert them into text. This bachelor's thesis focuses on using deep learning techniques to build an end-to-end Speech Recognition system. As a preliminary step, we overview the most relevant methods carried out over the last several years. Then, we study one of the latest proposals for this end-to-end approach that uses a sequence to sequence model with attention-based mechanisms. Next, we successfully reproduce the model and test it over the TIMIT database. We analyze the similarities and differences between the current implementation proposal and the original theoretical work. And finally, we experiment and contrast using different parameters (e.g. number of layer units, learning rates and batch sizes) and reduce the Phoneme Error Rate in almost 12% relative.Speech Recognition (reconeixement de veu) és la tasca que pretén indentificar paraules del llenguatge parlat i convertir-les a text. Aquest treball de fi de grau es centra en utilitzar tècniques de deep learning per construir un sistema d'Speech Recognition entrenant-lo end-to-end. Com a pas preliminar, fem un resum dels mètodes més rellevants duts a terme els últims anys. A continuació, estudiem un dels treballs més recents en aquesta à rea que proposa un model sequence to sequence amb l?atenció entrenat end-to-end. Després, reproduim satisfactòiament el model i l'avaluem amb la base de dades TIMIT. Analitzem les semblances i diferències entre l'implementació proposada i el treball teòric original. I finalment, experimentem i contrastem el model utilitzant diferents parà metres (e.g. nombre de neurones per capa, la taxa d'aprenentatge -learning rate- i els batch sizes) i reduim el Phoneme Error Rate gairebé un 12% relatiu
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
Arabic Handwritten Word Recognition based on Bernoulli Mixture HMMs
This thesis presents new approaches in off-line Arabic Handwriting Recognition based on
conventional Bernoulli Hidden Markov models. Until now, the off-line handwriting
recognition, in particular, the Arabic handwriting recognition is still far away form being
perfect. Hidden Markov Models (HMMs) are now widely used for off-line handwriting
recognition in many languages and, in particular, in Arabic. As in speech recognition, they
are usually built from shared, embedded HMMs at symbol level, in which state-conditional
probability density functions are modeled with Gaussian mixtures. In contrast to speech
recognition, however, it is unclear which kind of features should be used and, indeed, very
different features sets are in use today. Among them, we have recently proposed to simply
use columns of raw, binary image pixels, which are directly fed into embedded Bernoulli
(mixture) HMMs, that is, embedded HMMs in which the emission probabilities are modeled
with Bernoulli mixtures. The idea is to by-pass feature extraction and ensure that no
discriminative information is filtered out during feature extraction, which in some sense is
integrated into the recognition model. In this thesis, we review this idea along with some
extensions that are currently providing state-of-the-art results on Arabic handwritten word
recognition.Alkhoury, I. (2010). Arabic Handwritten Word Recognition based on Bernoulli Mixture HMMs. http://hdl.handle.net/10251/11478Archivo delegad
Phonetic study and text mining of Spanish for English to Spanish translation system
Projecte realitzat en col.laboraciĂł amb el centre University of Southern Californi
Automatic Quality Estimation for ASR System Combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for
system combination in automatic speech recognition (ASR). In order to select
the most appropriate words to insert at each position in the output
transcriptions, some ROVER extensions rely on critical information such as
confidence scores and other ASR decoder features. This information, which is
not always available, highly depends on the decoding process and sometimes
tends to over estimate the real quality of the recognized words. In this paper
we propose a novel variant of ROVER that takes advantage of ASR quality
estimation (QE) for ranking the transcriptions at "segment level" instead of:
i) relying on confidence scores, or ii) feeding ROVER with randomly ordered
hypotheses. We first introduce an effective set of features to compensate for
the absence of ASR decoder information. Then, we apply QE techniques to perform
accurate hypothesis ranking at segment-level before starting the fusion
process. The evaluation is carried out on two different tasks, in which we
respectively combine hypotheses coming from independent ASR systems and
multi-microphone recordings. In both tasks, it is assumed that the ASR decoder
information is not available. The proposed approach significantly outperforms
standard ROVER and it is competitive with two strong oracles that e xploit
prior knowledge about the real quality of the hypotheses to be combined.
Compared to standard ROVER, the abs olute WER improvements in the two
evaluation scenarios range from 0.5% to 7.3%
- …