Search CORE

26,206 research outputs found

Speech Recognition

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

Directory of Open Access Books (DOAB)

Recommended from our members

AdaStreamLite: Environment-adaptive Streaming Speech Recognition on Mobile Devices

Author: Du Junzhao
Liu Hui
Pan Jiangtao
Wei Yuheng
Xiong Jie
Yu Yingtao
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2024
Field of study

Streaming speech recognition aims to transcribe speech to text in a streaming manner, providing real-time speech interaction for smartphone users. However, it is not trivial to develop a high-performance streaming speech recognition system purely running on mobile platforms, due to the complex real-world acoustic environments and the limited computational resources of smartphones. Most existing solutions lack the generalization to unseen environments and have difficulty to work with streaming speech. In this paper, we design AdaStreamLite, an environment-adaptive streaming speech recognition tool for smartphones. AdaStreamLite interacts with its surroundings to capture the characteristics of the current acoustic environment to improve the robustness against ambient noise in a lightweight manner. We design an environment representation extractor to model acoustic environments with compact feature vectors, and construct a representation lookup table to improve the generalization of AdaStreamLite to unseen environments. We train our system using large speech datasets publicly available covering different languages. We conduct experiments in a large range of real acoustic environments with different smartphones. The results show that AdaStreamLite outperforms the state-of-the-art methods in terms of recognition accuracy, computational resource consumption and robustness against unseen environments

ScholarWorks@UMass Amherst

Proceedings of the SAB'06 Workshop on Adaptive Approaches for Optimizing Player Satisfaction in Computer and Physical Games

Author: Hallam John
SAB'06 Workshop on Adaptive Approaches for Optimizing Player Satisfaction in Computer and Physical Games
Yannakakis Georgios N.
Publication venue: University of Southern Denmark
Publication date: 01/01/2006
Field of study

These proceedings contain the papers presented at the Workshop on Adaptive approaches for Optimizing Player Satisfaction in Computer and Physical Games held at the Ninth international conference on the Simulation of Adaptive Behavior (SAB’06): From Animals to Animats 9 in Rome, Italy on 1 October 2006. We were motivated by the current state-of-the-art in intelligent game design using adaptive approaches. Artificial Intelligence (AI) techniques are mainly focused on generating human-like and intelligent character behaviors. Meanwhile there is generally little further analysis of whether these behaviors contribute to the satisfaction of the player. The implicit hypothesis motivating this research is that intelligent opponent behaviors enable the player to gain more satisfaction from the game. This hypothesis may well be true; however, since no notion of entertainment or enjoyment is explicitly defined, there is therefore little evidence that a specific character behavior generates enjoyable games. Our objective for holding this workshop was to encourage the study, development, integration, and evaluation of adaptive methodologies based on richer forms of humanmachine interaction for augmenting gameplay experiences for the player. We wanted to encourage a dialogue among researchers in AI, human-computer interaction and psychology disciplines who investigate dissimilar methodologies for improving gameplay experiences. We expected that this workshop would yield an understanding of state-ofthe- art approaches for capturing and augmenting player satisfaction in interactive systems such as computer games. Our invited speaker was Hakon Steinø, Technical Producer of IO-Interactive, who discussed applied AI research at IO-Interactive, portrayed the future trends of AI in computer game industry and debated the use of academic-oriented methodologies for augmenting player satisfaction. The sessions of presentations and discussions where classified into three themes: Adaptive Learning, Examples of Adaptive Games and Player Modeling. The Workshop Committee did a great job in providing suggestions and informative reviews for the submissions; thank you! This workshop was in part supported by the Danish National Research Council (project no: 274-05-0511). Finally, thanks to all the participants; we hope you found this to be useful!peer-reviewe

OAR@UM

I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

Author: Batliner A
Hantke S
Kurle R
Mousa AELD
Ringeval F
Schuller B
Weninger F
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/04/2016
Field of study

We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

Directory of Open Access Journals

Spiral - Imperial College Digital Repository

Improving Dysarthric Speech Recognition by Enriching Training Datasets

Author: Cullen Sophie
Publication venue: Technological University Dublin
Publication date: 01/01/2022
Field of study

Dysarthria is a motor speech disorder that results from disruptions in the neuro-motor interface and is characterised by poor articulation of phonemes and hyper-nasality and is characteristically different from normal speech. Many modern automatic speech recognition systems focus on a narrow range of speech diversity therefore as a consequence of this they exclude a groups of speakers who deviate in aspects of gender, race, age and speech impairment when building training datasets. This study attempts to develop an automatic speech recognition system that deals with dysarthric speech with limited dysarthric speech data. Speech utterances collected from the TORGO database are used to conduct experiments on a wav2vec2.0 model only trained on the Librispeech 960h dataset to obtain a baseline performance of the word error rate (WER) when recognising dysarthric speech. A version of the Librispeech model fine-tuned on multi-language datasets was tested to see if it would improve accuracy and achieved a top reduction of 24.15% in the WER for one of the male dysarthric speakers in the dataset. Transfer learning with speech recognition models and preprocessing dysarthric speech to improve its intelligibility by using general adversarial networks were limited in their potential due to a lack of dysarthric speech dataset of adequate size to use these technologies. The main conclusion drawn from this study is that a large diverse dysarthric speech dataset comparable to the size of datasets used to train machine learning ASR systems like Librispeech,with different types of speech, scripted and unscripted, is required to improve performance.

Arrow@TUDublin

Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

Author: Abel Andrew
Publication venue: University of Stirling
Publication date: 01/01/2013
Field of study

This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work

Stirling Online Research Repository

Phoneme and sentence-level ensembles for speech recognition

Author: Bengio Samy
Dimitrakakis Christos
Publication venue
Publication date: 01/01/2011
Field of study

We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Chalmers Research

Hochschulschriftenserver - Universität Frankfurt am Main

Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

Author: Krishna G. Sai
Kundrapu Supriya
Nemani Praneeth
Publication venue
Publication date: 14/06/2023
Field of study

Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research

arXiv.org e-Print Archive

Ethics of AI in Education: Towards a Community-Wide Framework

Author: Baker T
Bittencourt II
Cukurova M
Holmes W
Holstein K
Koedinger KR
Porayska-Pomsta K
Rodrigo MT
Santos OC
Shum SB
Sutherland E
Publication venue
Publication date: 09/04/2021
Field of study

While Artificial Intelligence in Education (AIED) research has at its core the desire to support student learning, experience from other AI domains suggest that such ethical intentions are not by themselves sufficient. There is also the need to consider explicitly issues such as fairness, accountability, transparency, bias, autonomy, agency, and inclusion. At a more general level, there is also a need to differentiate between doing ethical things and doing things ethically, to understand and to make pedagogical choices that are ethical, and to account for the ever-present possibility of unintended consequences. However, addressing these and related questions is far from trivial. As a first step towards addressing this critical gap, we invited 60 of the AIED community’s leading researchers to respond to a survey of questions about ethics and the application of AI in educational contexts. In this paper, we first introduce issues around the ethics of AI in education. Next, we summarise the contributions of the 17 respondents, and discuss the complex issues that they raised. Specific outcomes include the recognition that most AIED researchers are not trained to tackle the emerging ethical questions. A well-designed framework for engaging with ethics of AIED that combined a multidisciplinary approach and a set of robust guidelines seems vital in this context

UCL Discovery