158 research outputs found

    Study to determine potential flight applications and human factors design guidelines for voice recognition and synthesis systems

    Get PDF
    A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis. At first, a survey of voice recognition and synthesis technology was undertaken to develop a working knowledge base. Then, numerous potential aircraft and simulator flight deck voice applications were identified and each proposed application was rated on a number of criteria in order to achieve an overall payoff rating. The potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control. The ratings of the first three categories showed the most promise of being beneficial to flight deck operations. Possible applications of voice synthesis systems were categorized as automatic or pilot selectable and many were rated as being potentially beneficial. In addition, voice system implementation guidelines and pertinent performance criteria are proposed. Finally, the findings of this study are compared with those made in a recent NASA study of a 1995 transport concept

    Malay articulation system for early screening diagnostic using hidden markov model and genetic algorithm

    Get PDF
    Speech recognition is an important technology and can be used as a great aid for individuals with sight or hearing disabilities today. There are extensive research interest and development in this area for over the past decades. However, the prospect in Malaysia regarding the usage and exposure is still immature even though there is demand from the medical and healthcare sector. The aim of this research is to assess the quality and the impact of using computerized method for early screening of speech articulation disorder among Malaysian such as the omission, substitution, addition and distortion in their speech. In this study, the statistical probabilistic approach using Hidden Markov Model (HMM) has been adopted with newly designed Malay corpus for articulation disorder case following the SAMPA and IPA guidelines. Improvement is made at the front-end processing for feature vector selection by applying the silence region calibration algorithm for start and end point detection. The classifier had also been modified significantly by incorporating Viterbi search with Genetic Algorithm (GA) to obtain high accuracy in recognition result and for lexical unit classification. The results were evaluated by following National Institute of Standards and Technology (NIST) benchmarking. Based on the test, it shows that the recognition accuracy has been improved by 30% to 40% using Genetic Algorithm technique compared with conventional technique. A new corpus had been built with verification and justification from the medical expert in this study. In conclusion, computerized method for early screening can ease human effort in tackling speech disorders and the proposed Genetic Algorithm technique has been proven to improve the recognition performance in terms of search and classification task

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

    Get PDF
    Over the past few years, speech recognition technology performance on tasks ranging from isolated digit recognition to conversational speech has dramatically improved. Performance on limited recognition tasks in noiseree environments is comparable to that achieved by human transcribers. This advancement in automatic speech recognition technology along with an increase in the compute power of mobile devices, standardization of communication protocols, and the explosion in the popularity of the mobile devices, has created an interest in flexible voice interfaces for mobile devices. However, speech recognition performance degrades dramatically in mobile environments which are inherently noisy. In the recent past, a great amount of effort has been spent on the development of front ends based on advanced noise robust approaches. The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO and MFA front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry-standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends

    Representation learning for unsupervised speech processing

    Get PDF
    Automatic speech recognition for our most widely used languages has recently seen substantial improvements, driven by improved training procedures for deep artificial neural networks, cost-effective availability of computational power at large scale, and, crucially, availability of large quantities of labelled training data. This success cannot be transferred to low and zero resource languages where the requisite transcriptions are unavailable. Unsupervised speech processing promises better methods for dealing with under-resourced languages. Here we investigate unsupervised neural network based models for learning frame- and sequence- level representations with the goal of improving zero-resource speech processing. Good representations eliminate differences in accent, gender, channel characteristics, and other factors to model subword or whole-term units for within- and across- speaker speech unit discrimination. We present two contributions focussing on unsupervised learning of frame-level representations: (1) an improved version of the correspondence autoencoder applied to the INTERSPEECH 2015 Zero Resource Challenge, and (2) a proposed model for learning representations that explicitly optimize speech unit discrimination. We also present two contributions focussing on efficiency and scalability of unsupervised speech processing: (1) a proposed model and pilot experiments for learning a linear-time approximation of the quadratic-time dynamic time warping algorithm, and (2) a series of model proposals for learning fixed size representations of variable length speech segments enabling efficient vector space similarity measures

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Emotion Embeddings \unicode{x2014} Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets

    Full text link
    Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .Comment: 18 pages, 6 figure

    Structured prediction and generative modeling using neural networks

    Get PDF
    Cette thèse traite de l'usage des Réseaux de Neurones pour modélisation de données séquentielles. La façon dont l'information a été ordonnée et structurée est cruciale pour la plupart des données. Les mots qui composent ce paragraphe en constituent un exemple. D'autres données de ce type incluent les données audio, visuelles et génomiques. La Prédiction Structurée est l'un des domaines traitant de la modélisation de ces données. Nous allons aussi présenter la Modélisation Générative, qui consiste à générer des points similaires aux données sur lesquelles le modèle a été entraîné. Dans le chapitre 1, nous utiliserons des données clients afin d'expliquer les concepts et les outils de l'Apprentissage Automatique, incluant les algorithmes standards d'apprentissage ainsi que les choix de fonction de coût et de procédure d'optimisation. Nous donnerons ensuite les composantes fondamentales d'un Réseau de Neurones. Enfin, nous introduirons des concepts plus complexes tels que le partage de paramètres, les Réseaux Convolutionnels et les Réseaux Récurrents. Le reste du document, nous décrirons de plusieurs types de Réseaux de Neurones qui seront à la fois utiles pour la prédiction et la génération et leur application à des jeux de données audio, d'écriture manuelle et d'images. Le chapitre 2 présentera le Réseau Neuronal Récurrent Variationnel (VRNN pour variational recurrent neural network). Le VRNN a été développé dans le but de générer des échantillons semblables aux exemples de la base d'apprentissage. Nous présenterons des modèles entraînées de manière non-supervisée afin de générer du texte manuscrites, des effets sonores et de la parole. Non seulement ces modèles prouvent leur capacité à apprendre les caractéristiques de chaque type de données mais établissent aussi un standard en terme de performance. Dans le chapitre 3 sera présenté ReNet, un modèle récemment développé. ReNet utilise les sorties structurées d'un Réseau Neuronal Récurrent pour classifier des objets. Ce modèle atteint des performances compétitives sur plusieurs tâches de reconnaissance d'images, tout en utilisant une architecture conçue dès le départ pour de la Prédiction Structurée. Dans ce cas-ci, les résultats du modèle sont utilisés simplement pour de la classification mais des travaux suivants (non inclus ici) ont utilisé ce modèle pour de la Prédiction Structurée. Enfin, au Chapitre 4 nous présentons les résultats récents non-publiés en génération acoustique. Dans un premier temps, nous fournissons les concepts musicaux et représentations numériques fondamentaux à la compréhension de notre approche et introduisons ensuite une base de référence et de nouveaux résultats de recherche avec notre modèle, RNN-MADE. Ensuite, nous introduirons le concept de synthèse vocale brute et discuterons de notre recherche en génération. Dans notre dernier Chapitre, nous présenterons enfin un résumé des résultats et proposerons de nouvelles pistes de recherche.In this thesis we utilize neural networks to effectively model data with sequential structure. There are many forms of data for which both the order and the structure of the information is incredibly important. The words in this paragraph are one example of this type of data. Other examples include audio, images, and genomes. The work to effectively model this type of ordered data falls within the field of structured prediction. We also present generative models, which attempt to generate data that appears similar to the data which the model was trained on. In Chapter 1, we provide an introduction to data and machine learning. First, we motivate the need for machine learning by describing an expert system built on a customer database. This leads to a discussion of common algorithms, losses, and optimization choices in machine learning. We then progress to describe the basic building blocks of neural networks. Finally, we add complexity to the models, discussing parameter sharing and convolutional and recurrent layers. In the remainder of the document, we discuss several types of neural networks which find common use in both prediction and generative modeling and present examples of their use with audio, handwriting, and images datasets. In Chapter 2, we introduce a variational recurrent neural network (VRNN). Our VRNN is developed with to generate new sequential samples that resemble the dataset that is was trained on. We present models that learned in an unsupervised manner how to generate handwriting, sound effects, and human speech setting benchmarks in performance. Chapter 3 shows a recently developed model called ReNet. In ReNet, intermediate structured outputs from recurrent neural networks are used for object classification. This model shows competitive performance on a number of image recognition tasks, while using an architecture designed to handle structured prediction. In this case, the final model output is only used for simple classification, but follow-up work has expanded to full structured prediction. Lastly, in Chapter 4 we present recent unpublished experiments in sequential audio generation. First we provide background in musical concepts and digital representation which are fundamental to understanding our approach and then introduce a baseline and new research results using our model, RNN-MADE. Next we introduce the concept of raw speech synthesis and discuss our investigation into generation. In our final chapter, we present a brief summary of results and postulate future research directions
    corecore