281 research outputs found
Fuzzy reasoning in confidence evaluation of speech recognition
Confidence measures represent a systematic way to express reliability of speech recognition results. A common approach to confidence measuring is to take profit of the information that several recognition-related features offer and to combine them, through a given compilation mechanism , into a more effective way to distinguish between correct and incorrect recognition results. We propose to use a fuzzy reasoning scheme to perform the information compilation step. Our approach opposes the previously proposed ones because ours treats the uncertainty of recognition hypotheses in terms ofPeer ReviewedPostprint (published version
Language modeling using X-grams
In this paper, an extension of n-grams, called x-grams, is proposed. In this extension, the memory of the model (n) is not fixed a priori. Instead, large memories are accepted first, and merging criteria are then applied to reduce the complexity and to ensure reliable estimations. The results show how the perplexity obtained with x-grams is smaller than that of n-grams. Furthermore, the complexity is smaller than trigrams and can become close to bigrams.Peer ReviewedPostprint (published version
A second opinion approach for speech recognition verification
In order to improve the reliability of speech recognition results, a verifying system, that takes profit of the information given from an alternative recognition step is proposed. The alternative results are considered as a second opinion about the nature of the speech recognition process. Some features are extracted from both opinion sources and compiled, through a fuzzy inference system, into a more discriminant confidence measure able to verify correct results and disregard wrong ones. This approach is tested in a keyword spotting task taken form the Spanish SpeechDat database. Results show a considerable reduction of false rejections at a fixed false alarm rate compared to baseline systems.Peer ReviewedPostprint (published version
Contextual confidence measures for continuous speech recognition
This paper explores the repercussion of contextual information into confidence measuring for continuous speech recognition results. Our approach comprises three steps: to extract confidence predictors out of recognition results, to compile those predictors into confidence measures by means of a fuzzy inference system whose parameters have been estimated, directly from examples, with an evolutionary strategy and, finally, to upgrade the confidence measures by the inclusion of contextual information. Through experimentation with two different continuous speech application tasks, results show that the context re-scoring procedure improves the capabilities of confidence measures to discriminate between correct and incorrect recognition results for every level of thresholding, even when a rather simple method to add contextual information is considered.Peer ReviewedPostprint (published version
Using x-gram for efficient speech recognition
X-grams are a generalization of the n-grams, where the number of previous conditioning words is different for each case and decided from the training data. X-grams reduce perplexity with respect to trigrams and need less number of parameters. In this paper, the representation of the x-grams using finite state automata is considered. This representation leads to a new model, the non-deterministic x-grams, an approximation that is much more efficient, suffering small degradation on the modeling capability. Empirical experiments for a continuous speech recognition task show how, for each ending word, the number of transitions is reduced from 1222 (the size of the lexicon) to around 66.Peer ReviewedPostprint (published version
Monolingual and bilingual spanish-catalan speech recognizers developed from SpeechDat databases
Under the SpeechDat specifications, the Spanish member of SpeechDat consortium has recorded a Catalan database that includes one
thousand speakers. This communication describes some experimental work that has been carried out using both the Spanish and the
Catalan speech material.
A speech recognition system has been trained for the Spanish language using a selection of the phonetically balanced utterances from
the 4500 SpeechDat training sessions. Utterances with mispronounced or incomplete words and with intermittent noise were discarded.
A set of 26 allophones was selected to account for the Spanish sounds and clustered demiphones have been used as context dependent
sub-lexical units. Following the same methodology, a recognition system was trained from the Catalan SpeechDat database. Catalan
sounds were described with 32 allophones. Additionally, a bilingual recognition system was built for both the Spanish and Catalan
languages. By means of clustering techniques, the suitable set of allophones to cover simultaneously both languages was determined.
Thus, 33 allophones were selected. The training material was built by the whole Catalan training material and the Spanish material
coming from the Eastern region of Spain (the region where Catalan is spoken).
The performance of the Spanish, Catalan and bilingual systems were assessed under the same framework. The Spanish system exhibits
a significantly better performance than the rest of systems due to its better training. The bilingual system provides an equivalent
performance to that afforded by both language specific systems trained with the Eastern Spanish material or the Catalan SpeechDat
corpus.Peer ReviewedPostprint (published version
Comunicación oral con el computador
Concepts and techniques involved in speech communication with computers are reviewed. Although the speech output is, at present, closer to the computer ability than the speech understanding, at the moment only partial results for both purposes have been attained.Peer ReviewedPostprint (published version
Adaptive prediction and bit-assignment in subband coding of speech
The combination of time-domain harmonic scaling (TDHS) and sub-band coding (SBC) provides an encoding approach which allows 9.6 Kb/s speech encoding with good communication quality. Starting from this structure, this paper focuses the improvement of earlier designs. It is shown that adaptive prediction and bit-assigment enhances the subband signal coding and, hence, the performance of the overall system.
The prediction is realized by an adaptive lattice, the algorithm being GAL2. The dynamic bit allocation takes place from the step-sizes of the backward adaptive quantizers (Jayant) in each sub-band. Improvements as high as 5 dB can be achieved for the average segmented signal to noise ratio.Peer ReviewedPostprint (published version
- …