36 research outputs found
Generación de una voz sintética en Castellano basada en HSMM para la Evaluación Albayzín 2008: conversión texto a voz
Este artículo describe el proceso de generación de una voz
en castellano utilizando el corpus UPC ESMA de UPC
proporcionado por la Evaluación Albayzín 2008: Conversión
Texto a Voz. Se ha implementado una voz basada
en selección de unidades mediante el paquete Multisyn
de Festival y otra basada en Hidden Semi-Markov Models
(HSMM) mediante HTS. Tras una breve evaluación
de la calidad de ambas voces, se detallan las características
principales de la voz basada en HSMM, sistema final
presentado a la evaluación
Proposing a speech to gesture translation architecture for Spanish deaf people.
This article describes an architecture for translating speech into Spanish Sign Language (SSL). The architecture proposed is made up of four modules: speech recognizer, semantic analysis, gesture sequence generation and gesture playing. For the speech recognizer and the semantic analysis modules, we use software developed by IBM and CSLR (Center for Spoken Language Research at University of Colorado), respectively. Gesture sequence generation and gesture animation are the modules on which we have focused our main effort. Gesture sequence generation uses semantic concepts (obtained from the semantic analysis) associating them with several SSL gestures. This association is carried out based on a number of generation rules. For gesture animation, we have developed an animated agent (virtual representation of a human person) and a strategy for reducing the effort in gesture animation. This strategy consists of making the system automatically generate all agent positions necessary for the gesture animation. In this process, the system uses a few main agent positions (two or three per second) and some interpolation strategies, both issues previously generated by the service developer (the person who adapts the architecture proposed in this paper to a specific domain). Related to this module, we propose a distance between agent positions and a measure of gesture complexity. This measure can be used to analyze the gesture perception versus its complexity. With the architecture proposed, we are not trying to build a domain independent translator but a system able to translate speech utterances into gesture sequences in a restricted domain: railway, flights or weather information
Real Field Deployment of a Smart Fiber Optic Surveillance System for Pipeline Integrity Threat Detection: Architectural Issues and Blind Field Test Results
This paper presents an on-line augmented surveillance
system that aims to real time monitoring of activities
along a pipeline. The system is deployed in a fully realistic
scenario and exposed to real activities carried out in unknown
places at unknown times within a given test time interval (socalled
blind field tests). We describe the system architecture that
includes specific modules to deal with the fact that continuous
on-line monitoring needs to be carried out, while addressing
the need of limiting the false alarms at reasonable rates. To
the best or our knowledge, this is the first published work in
which a pipeline integrity threat detection system is deployed
in a realistic scenario (using a fiber optic along an active gas
pipeline) and is thoroughly and objectively evaluated in realistic
blind conditions. The system integrates two operation modes:
The machine+activity identification mode identifies the machine
that is carrying out a certain activity along the pipeline, and the
threat detection mode directly identifies if the activity along the
pipeline is a threat or not. The blind field tests are carried out
in two different pipeline sections: The first section corresponds
to the case where the sensor is close to the sensed area, while
the second one places the sensed area about 35 km far from
the sensor. Results of the machine+activity identification mode
showed an average machine+activity classification rate of 46:6%.
For the threat detection mode, 8 out of 10 threats were correctly
detected, with only 1 false alarm appearing in a 55:5-hour sensed
period.European CommissionMinisterio de Economía y CompetitividadComunidad de Madri
Word Pair Speech
In this paper we present a speech understanding system that accepts continuous speech sentences as input to command a HIFI set. The string of words obtained from the recogniser is sent to the understanding system that tries to fill in a set of frames specifying the triplet (SUBSYSTEM, PARAMETER, VALUE). The understanding module follows the philosophy presented in [1]. The triplets are finally translated into infrared commands by an actuator module to be sent to the HIFI set, composed by a radio, a three deck CD player and a two tape cassette recorder/player. All circumstances (understanding incompleteness, HIFI set status, result of the command execution) are confirmed back to the user via a text to speech system with substitutable-concept pattern-based generated messages. We have introduced a response module because some of the final users will be blind people, and because we are studying the possibility of establishing restricted dialogues with the users in order to complete or correct the commands. The understanding engine is based on semantic-like tagging
EFFICIENT NN-BASED SEARCH SPACE REDUCTION IN A LARGE VOCABULARY SPEECH RECOGNITION SYSTEM
In very large vocabulary speech recognition systems using the hypothesis-verification paradigm, the verification stage is usually the most time consuming. State of the art systems combine fixed size hypothesized search spaces with advanced pruning techniques. In this paper we propose a novel strategy to dynamically calculate the hypothesized search space, using neural networks as the estimation module and designing the input feature set with a careful greedy-based selection approach. The main achievement has been a statistically significant relative decrease in error rate of 33.53%, while getting a relative decrease in average computational demands of up to 19.40%
Improved Variable Preselection List Length Estimation Using NNs
In very large vocabulary hypothesis-verification systems, the fine acoustic matcher is usually the most time consuming, so that the main concern is reducing the preselection list length as much as possible. Traditionally, these systems use a too high fixed preselection list length, increasing computational demands over the really needed. The idea we are proposing is estimating a different preselection list length for every utterance, so that we can lower the average computational effort needed for the recognition process. As we will show, it’s even possible that the resulting system outperforms the fixed length one in error rate, even when reducing computational cost. This paper presents a detailed study on a NN based approach to variable preselection list length estimation. The main achievement has been a relative decrease in error rate of up to 40%, while getting a relative decrease in average preselection list length of up to 31%. 1