26,206 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Proceedings of the SAB'06 Workshop on Adaptive Approaches for Optimizing Player Satisfaction in Computer and Physical Games

    Get PDF
    These proceedings contain the papers presented at the Workshop on Adaptive approaches for Optimizing Player Satisfaction in Computer and Physical Games held at the Ninth international conference on the Simulation of Adaptive Behavior (SAB’06): From Animals to Animats 9 in Rome, Italy on 1 October 2006. We were motivated by the current state-of-the-art in intelligent game design using adaptive approaches. Artificial Intelligence (AI) techniques are mainly focused on generating human-like and intelligent character behaviors. Meanwhile there is generally little further analysis of whether these behaviors contribute to the satisfaction of the player. The implicit hypothesis motivating this research is that intelligent opponent behaviors enable the player to gain more satisfaction from the game. This hypothesis may well be true; however, since no notion of entertainment or enjoyment is explicitly defined, there is therefore little evidence that a specific character behavior generates enjoyable games. Our objective for holding this workshop was to encourage the study, development, integration, and evaluation of adaptive methodologies based on richer forms of humanmachine interaction for augmenting gameplay experiences for the player. We wanted to encourage a dialogue among researchers in AI, human-computer interaction and psychology disciplines who investigate dissimilar methodologies for improving gameplay experiences. We expected that this workshop would yield an understanding of state-ofthe- art approaches for capturing and augmenting player satisfaction in interactive systems such as computer games. Our invited speaker was Hakon Steinø, Technical Producer of IO-Interactive, who discussed applied AI research at IO-Interactive, portrayed the future trends of AI in computer game industry and debated the use of academic-oriented methodologies for augmenting player satisfaction. The sessions of presentations and discussions where classified into three themes: Adaptive Learning, Examples of Adaptive Games and Player Modeling. The Workshop Committee did a great job in providing suggestions and informative reviews for the submissions; thank you! This workshop was in part supported by the Danish National Research Council (project no: 274-05-0511). Finally, thanks to all the participants; we hope you found this to be useful!peer-reviewe

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Improving Dysarthric Speech Recognition by Enriching Training Datasets

    Get PDF
    Dysarthria is a motor speech disorder that results from disruptions in the neuro-motor interface and is characterised by poor articulation of phonemes and hyper-nasality and is characteristically different from normal speech. Many modern automatic speech recognition systems focus on a narrow range of speech diversity therefore as a consequence of this they exclude a groups of speakers who deviate in aspects of gender, race, age and speech impairment when building training datasets. This study attempts to develop an automatic speech recognition system that deals with dysarthric speech with limited dysarthric speech data. Speech utterances collected from the TORGO database are used to conduct experiments on a wav2vec2.0 model only trained on the Librispeech 960h dataset to obtain a baseline performance of the word error rate (WER) when recognising dysarthric speech. A version of the Librispeech model fine-tuned on multi-language datasets was tested to see if it would improve accuracy and achieved a top reduction of 24.15% in the WER for one of the male dysarthric speakers in the dataset. Transfer learning with speech recognition models and preprocessing dysarthric speech to improve its intelligibility by using general adversarial networks were limited in their potential due to a lack of dysarthric speech dataset of adequate size to use these technologies. The main conclusion drawn from this study is that a large diverse dysarthric speech dataset comparable to the size of datasets used to train machine learning ASR systems like Librispeech,with different types of speech, scripted and unscripted, is required to improve performance.

    Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

    Get PDF
    This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work

    Phoneme and sentence-level ensembles for speech recognition

    Get PDF
    We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition

    Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

    Full text link
    Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research

    Ethics of AI in Education: Towards a Community-Wide Framework

    Get PDF
    While Artificial Intelligence in Education (AIED) research has at its core the desire to support student learning, experience from other AI domains suggest that such ethical intentions are not by themselves sufficient. There is also the need to consider explicitly issues such as fairness, accountability, transparency, bias, autonomy, agency, and inclusion. At a more general level, there is also a need to differentiate between doing ethical things and doing things ethically, to understand and to make pedagogical choices that are ethical, and to account for the ever-present possibility of unintended consequences. However, addressing these and related questions is far from trivial. As a first step towards addressing this critical gap, we invited 60 of the AIED community’s leading researchers to respond to a survey of questions about ethics and the application of AI in educational contexts. In this paper, we first introduce issues around the ethics of AI in education. Next, we summarise the contributions of the 17 respondents, and discuss the complex issues that they raised. Specific outcomes include the recognition that most AIED researchers are not trained to tackle the emerging ethical questions. A well-designed framework for engaging with ethics of AIED that combined a multidisciplinary approach and a set of robust guidelines seems vital in this context
    corecore