790 research outputs found
Multilingual audio information management system based on semantic knowledge in complex environments
This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users.This work is being funded by Grants: TEC201677791-C4 from Plan Nacional de I + D + i, Ministry of Economic Affairs and Competitiveness of Spain and from the DomusVi Foundation Kms para recorder, the Basque Government (ELKARTEK KK-2018/00114, GEJ IT1189-19, the Government of Gipuzkoa (DG18/14 DG17/16), UPV/EHU (GIU19/090), COST ACTION (CA18106, CA15225)
Real-Time Subtitle Generator for Sinhala Speech
In today’s digital era, the significance of speech recognition technology cannot be overstated as it plays a pivotal role in enabling human-computer interaction and supporting various applications. This paper focuses on the development of a real-time subtitle generator for Sinhala speech using speech recognition techniques. The CMUSphinx toolkit, an open-source toolkit based on the Hidden Markov Model (HMM), is employed for the implementation of the application. Mel-frequency cepstral coefficients (MFCC) are utilized for feature extraction from the given ’wav’ format recordings. The paper places significant emphasis on the importance of a real-time subtitle generator for Sinhala speech and explores the existing literature in the field. It outlines the objectives of the research and discusses the achieved outcomes. By fine-tuning hyperparameters to enhance the recognition accuracy of the system, impressive results of 88.28% training accuracy and 11.72% Word Error Rate (WER) are attained. Thesignificance of this research is underscored by its methodological advancements, robust performance metrics, and the potential impact on facilitating seamless interactions and applications in the Sinhala speech domain.
Keywords: Speech recognition, Real-time, Subtitle, CMUSphinx, Open source, Hidden Markov Model, Mel-frequency cepstral coefficients, ’wav’, Accuracy, Word Error Rat
Confidence Scoring and Speaker Adaptation in Mobile Automatic Speech Recognition Applications
Generally, the user group of a language is remarkably diverse in terms of speaker-specific characteristics such as dialect and speaking style. Hence, quality of spoken content varies notably from one individual to another. This diversity causes problems for Automatic Speech Recognition systems. An Automatic Speech Recognition system should be able to assess the hypothesised results. This can be done by evaluating a confidence measure on the recognition results and comparing the resulting measure to a specified threshold. This threshold value, referred to as confidence score, informs how reliable a particular recognition result is for the given speech.
A system should perform optimally irrespective of input speaker characteristics. However, most systems are inflexible and non-adaptive and thus, speaker adaptability can be improved. For achieving these purposes, a solid criterion is required to evaluate the quality of spoken content and the system should be made robust and adaptive towards new speakers as well.
This thesis implements a confidence score using posterior probabilities to examine the quality of the output, based on the speech data and corpora provided by Devoca Oy. Furthermore, speaker adaptation algorithms: Maximum Likelihood Linear Regression and Maximum a Posteriori are applied on a GMM-HMM system and their results are compared. Experiments show that Maximum a Posteriori adaptation brings 2% to 25% improvement in word error rates of semi-continuous model and is recommended for use in the commercial product. The results of other methods are also reported. In addition, word graph is suggested as the method for obtaining posterior probabilities. Since it guarantees no such improvement in the results, the confidence score is proposed as an optional feature for the system
Effective weakly supervised semantic frame induction using expression sharing in hierarchical hidden Markov models
We present a framework for the induction of semantic frames from utterances
in the context of an adaptive command-and-control interface. The system is
trained on an individual user's utterances and the corresponding semantic
frames representing controls. During training, no prior information on the
alignment between utterance segments and frame slots and values is available.
In addition, semantic frames in the training data can contain information that
is not expressed in the utterances. To tackle this weakly supervised
classification task, we propose a framework based on Hidden Markov Models
(HMMs). Structural modifications, resulting in a hierarchical HMM, and an
extension called expression sharing are introduced to minimize the amount of
training time and effort required for the user.
The dataset used for the present study is PATCOR, which contains commands
uttered in the context of a vocally guided card game, Patience. Experiments
were carried out on orthographic and phonetic transcriptions of commands,
segmented on different levels of n-gram granularity. The experimental results
show positive effects of all the studied system extensions, with some effect
differences between the different input representations. Moreover, evaluation
experiments on held-out data with the optimal system configuration show that
the extended system is able to achieve high accuracies with relatively small
amounts of training data
- …