22 research outputs found

    Automatic transcription and phonetic labelling of dyslexic children's reading in Bahasa Melayu

    Get PDF
    Automatic speech recognition (ASR) is potentially helpful for children who suffer from dyslexia. Highly phonetically similar errors of dyslexic children‟s reading affect the accuracy of ASR. Thus, this study aims to evaluate acceptable accuracy of ASR using automatic transcription and phonetic labelling of dyslexic children‟s reading in BM. For that, three objectives have been set: first to produce manual transcription and phonetic labelling; second to construct automatic transcription and phonetic labelling using forced alignment; and third to compare between accuracy using automatic transcription and phonetic labelling and manual transcription and phonetic labelling. Therefore, to accomplish these goals methods have been used including manual speech labelling and segmentation, forced alignment, Hidden Markov Model (HMM) and Artificial Neural Network (ANN) for training, and for measure accuracy of ASR, Word Error Rate (WER) and False Alarm Rate (FAR) were used. A number of 585 speech files are used for manual transcription, forced alignment and training experiment. The recognition ASR engine using automatic transcription and phonetic labelling obtained optimum results is 76.04% with WER as low as 23.96% and FAR is 17.9%. These results are almost similar with ASR engine using manual transcription namely 76.26%, WER as low as 23.97% and FAR a 17.9%. As conclusion, the accuracy of automatic transcription and phonetic labelling is acceptable to use it for help dyslexic children learning using ASR in Bahasa Melayu (BM

    Read my lips: Speech distortions in musical lyrics can be overcome (slightly) by facial information

    No full text
    Understanding the lyrics of many contemporary songs is difficult, and an earlier study [Hidalgo-Barnes, M., Massaro, D.W., 2007. Read my lips: an animated face helps communicate musical lyrics. Psychomusicology 19, 3–12] showed a benefit for lyrics recognition when seeing a computer-animated talking head (Baldi) mouthing the lyrics along with hearing the singer. However, the contribution of visual information was relatively small compared to what is usually found for speech. In the current experiments, our goal was to determine why the face appears to contribute less when aligned with sung lyrics than when aligned with normal speech presented in noise. The first experiment compared the contribution of the talking head with the originally sung lyrics versus the case when it was aligned with the Festival text-to-speech synthesis (TtS) spoken at the original duration of the song’s lyrics. A small and similar influence of the face was found in both conditions. In the three experiments, we compared the presence of the face when the durations of the TtS were equated with the duration of the original musical lyrics to the case when the lyrics were read with typical TtS durations and this speech embedded in noise. The results indicated that the unusual temporally distorted durations of musical lyrics decreases the contribution of the visible speech from the face

    Intelligibility research in Brazil: empirical findings and methodological issues

    Get PDF
    Abstract The current paper addresses intelligibility, a dimension used to assess second language speech, which has also been proposed as one of the goals in pronunciation instruction. Studies carried out on this construct in Brazil are revisited (BECKER, 2013; CRUZ, 2005; 2006; 2008, 2012a, 2012b; CRUZ; PEREIRA, 2006; GONÇALVES, 2014; REIS; CRUZ, 2010; RIELLA, 2013; SCHADECH, 2013), and their main findings are discussed taking into account Jenkins’ (2002) Lingua Franca core. Furthermore, methodological issues are discussed, pointing out the different foci of the studies conducted in Brazil, the variables examined by the Brazilian studies at present, and the myriad of variables contemplated by international studies that still need investigation in the Brazilian context. Some of these variables are related to the speaker/listener or are of linguistic nature (e.g., L2 proficiency, accent familiarity, lexical frequency), all of which could help us to understand the intelligibility construct. Finally, the paper brings concluding remarks about the investigation of intelligibility and possible implications for the classroom and the research realms. Keywords: Intelligibility; Brazilian English; Research method; Pronunciation assessment.   A pesquisa em inteligibilidade no Brasil: resultados empĂ­ricos e questĂ”es metodolĂłgicas Resumo Este artigo tem como foco a inteligibilidade, uma dimensĂŁo utilizada para avaliar a fala na segunda lĂ­ngua, que tambĂ©m foi proposta como uma das metas para o ensino de pronĂșncia na sala de aula. Estudos realizados no Brasil sĂŁo revisitados BECKER, 2013; CRUZ, 2005; 2006; 2008, 2012a, 2012b; CRUZ; PEREIRA, 2006; GONÇALVES, 2014; REIS; CRUZ, 2010; RIELLA, 2013; SCHADECH, 2013) e os principais achados desses estudos sĂŁo discutidos, levando-se em consideração a proposta de Jenkins (2002), o Lingua Franca core. Por fim, questĂ”es metodolĂłgicas sĂŁo discutidas, ressaltando os diferentes focos estabelecidos nos estudos sobre inteligibilidade, assim como a mirĂ­ade de variĂĄveis contempladas em estudos internacionais que ainda precisam ser incorporadas em investigaçÔes no territĂłrio nacional. Algumas dessas variĂĄveis estĂŁo relacionadas ao falante, ao ouvinte, ou sĂŁo de natureza linguĂ­stica (proficiĂȘncia na lĂ­ngua estrangeira, familiaridade com o sotaque, frequĂȘncia lexical). Por fim, sĂŁo apresentadas conclusĂ”es sobre os estudos envolvendo inteligibilidade e possĂ­veis implicaçÔes para a sala de aula e para a pesquisa de cunho aplicado. Palavras-chave: Inteligibilidade; InglĂȘs brasileiro; MĂ©todo de pesquisa. Avaliação da pronĂșncia

    Automated Semantic Understanding of Human Emotions in Writing and Speech

    Get PDF
    Affective Human Computer Interaction (A-HCI) will be critical for the success of new technologies that will prevalent in the 21st century. If cell phones and the internet are any indication, there will be continued rapid development of automated assistive systems that help humans to live better, more productive lives. These will not be just passive systems such as cell phones, but active assistive systems like robot aides in use in hospitals, homes, entertainment room, office, and other work environments. Such systems will need to be able to properly deduce human emotional state before they determine how to best interact with people. This dissertation explores and extends the body of knowledge related to Affective HCI. New semantic methodologies are developed and studied for reliable and accurate detection of human emotional states and magnitudes in written and spoken speech; and for mapping emotional states and magnitudes to 3-D facial expression outputs. The automatic detection of affect in language is based on natural language processing and machine learning approaches. Two affect corpora were developed to perform this analysis. Emotion classification is performed at the sentence level using a step-wise approach which incorporates sentiment flow and sentiment composition features. For emotion magnitude estimation, a regression model was developed to predict evolving emotional magnitude of actors. Emotional magnitudes at any point during a story or conversation are determined by 1) previous emotional state magnitude; 2) new text and speech inputs that might act upon that state; and 3) information about the context the actors are in. Acoustic features are also used to capture additional information from the speech signal. Evaluation of the automatic understanding of affect is performed by testing the model on a testing subset of the newly extended corpus. To visualize actor emotions as perceived by the system, a methodology was also developed to map predicted emotion class magnitudes to 3-D facial parameters using vertex-level mesh morphing. The developed sentence level emotion state detection approach achieved classification accuracies as high as 71% for the neutral vs. emotion classification task in a test corpus of children’s stories. After class re-sampling, the results of the step-wise classification methodology on a test sub-set of a medical drama corpus achieved accuracies in the 56% to 84% range for each emotion class and polarity. For emotion magnitude prediction, the developed recurrent (prior-state feedback) regression model using both text-based and acoustic based features achieved correlation coefficients in the range of 0.69 to 0.80. This prediction function was modeled using a non-linear approach based on Support Vector Regression (SVR) and performed better than other approaches based on Linear Regression or Artificial Neural Networks

    Human-human multi-threaded spoken dialogs in the presence of driving

    Get PDF
    The problem addressed in this research is that engineers looking for interface designs do not have enough data about the interaction between multi-threaded dialogs and manual-visual tasks. Our goal was to investigate this interaction. We proposed to analyze how humans handle multi-threaded dialogs while engaged in a manual-visual task. More specifically, we looked at the interaction between performance on two spoken tasks and driving. The novelty of this dissertation is in its focus on the intersection between a manual-visual task and a multi-threaded speech communication between two humans. We proposed an experiment setup that is suitable for investigating multi-threaded spoken dialogs while subjects are involved in a manual-visual task. In our experiments one participant drove a simulated vehicle while talking with another participant located in a different room. The participants communicated using headphones and microphones. Both participants performed an ongoing task, which was interrupted by an interrupting task. Both tasks, the ongoing task and the interrupting task, were done using speech. We collected corpora of annotated data from our experiments and analyzed the data to verify the suitability of the proposed experiment setup. We found that, as expected, driving and our spoken tasks influenced each other. We also found that the timing of interruption influenced the spoken tasks. Unexpectedly, the data indicate that the ongoing task was more influenced by driving than the interrupting task. On the other hand, the interrupting task influenced driving more than the ongoing task. This suggests that the multiple resource model [1] does not capture the complexity of the interactions between the manual-visual and spoken tasks. We proposed that the perceived urgency or the perceived task difficulty plays a role in how the tasks influence each other

    Using auxiliary sources of knowledge for automatic speech recognition

    Get PDF
    Standard hidden Markov model (HMM) based automatic speech recognition (ASR) systems usually use cepstral features as acoustic observation and phonemes as subword units. Speech signal exhibits wide range of variability such as, due to environmental variation, speaker variation. This leads to different kinds of mismatch, such as, mismatch between acoustic features and acoustic models or mismatch between acoustic features and pronunciation models (given the acoustic models). The main focus of this work is on integrating auxiliary knowledge sources into standard ASR systems so as to make the acoustic models more robust to the variabilities in the speech signal. We refer to the sources of knowledge that are able to provide additional information about the sources of variability as auxiliary sources of knowledge. The auxiliary knowledge sources that have been primarily investigated in the present work are auxiliary features and auxiliary subword units. Auxiliary features are secondary source of information that are outside of the standard cepstral features. They can be estimation from the speech signal (e.g., pitch frequency, short-term energy and rate-of-speech), or additional measurements (e.g., articulator positions or visual information). They are correlated to the standard acoustic features, and thus can aid in estimating better acoustic models, which would be more robust to variabilities present in the speech signal. The auxiliary features that have been investigated are pitch frequency, short-term energy and rate-of-speech. These features can be modelled in standard ASR either by concatenating them to the standard acoustic feature vectors or by using them to condition the emission distribution (as done in gender-based acoustic modelling). We have studied these two approaches within the framework of hybrid HMM/artificial neural networks based ASR, dynamic Bayesian network based ASR and TANDEM system on different ASR tasks. Our studies show that by modelling auxiliary features along with standard acoustic features the performance of the ASR system can be improved in both clean and noisy conditions. We have also proposed an approach to evaluate the adequacy of the baseform pronunciation model of words. This approach allows us to compare between different acoustic models as well as to extract pronunciation variants. Through the proposed approach to evaluate baseform pronunciation model, we show that the matching and discriminative properties of single baseform pronunciation can be improved by integrating auxiliary knowledge sources in standard ASR. Standard ASR systems use usually phonemes as the subword units in a Markov chain to model words. In the present thesis, we also study a system where word models are described by two parallel chains of subword units: one for phonemes and the other are for graphemes (phoneme-grapheme based ASR). Models for both types of subword units are jointly learned using maximum likelihood training. During recognition, decoding is performed using either or both of the subword unit chains. In doing so, we thus have used graphemes as auxiliary subword units. The main advantage of using graphemes is that the word models can be defined easily using the orthographic transcription, thus being relatively noise free as compared to word models based upon phoneme units. At the same time, there are drawbacks to using graphemes as subword units, since there is a weak correspondence between the grapheme and the phoneme in languages such as English. Experimental studies conducted for American English on different ASR tasks have shown that the proposed phoneme-grapheme based ASR system can perform better than the standard ASR system that uses only phonemes as its subword units. Furthermore, while modelling context-dependent graphemes (similar to context-dependent phonemes), we observed that context-dependent graphemes behave like phonemes. ASR studies conducted on different tasks showed that by modelling context-dependent graphemes only (without any phonetic information) performance competitive to the state-of-the-art context-dependent phoneme-based ASR system can be obtained

    Programming Language Techniques for Natural Language Applications

    Get PDF
    It is easy to imagine machines that can communicate in natural language. Constructing such machines is more difficult. The aim of this thesis is to demonstrate how declarative grammar formalisms that distinguish between abstract and concrete syntax make it easier to develop natural language applications. We describe how the type-theorectical grammar formalism Grammatical Framework (GF) can be used as a high-level language for natural language applications. By taking advantage of techniques from the field of programming language implementation, we can use GF grammars to perform portable and efficient parsing and linearization, generate speech recognition language models, implement multimodal fusion and fission, generate support code for abstract syntax transformations, generate dialogue managers, and implement speech translators and web-based syntax-aware editors. By generating application components from a declarative grammar, we can reduce duplicated work, ensure consistency, make it easier to build multilingual systems, improve linguistic quality, enable re-use across system domains, and make systems more portable

    The SSPNet-Mobile Corpus: from the detection of non-verbal cues to the inference of social behaviour during mobile phone conversations

    Get PDF
    Mobile phones are one of the main channels of communication in contemporary society. However, the effect of the mobile phone on both the process of and, also, the non-verbal behaviours used during conversations mediated by this technology, remain poorly understood. This thesis aims to investigate the role of the phone on the negotiation process as well as, the automatic analysis of non-verbal behavioural cues during conversations using mobile telephones, by following the Social Signal Processing approach. The work in this thesis includes the collection of a corpus of 60 mobile phone conversations involving 120 subjects, development of methods for the detection of non-verbal behavioural events (laughter, fillers, speech and silence) and the inference of characteristics influencing social interactions (personality traits and conflict handling style) from speech and movements while using the mobile telephone, as well as the analysis of several factors that influence the outcome of decision-making processes while using mobile phones (gender, age, personality, conflict handling style and caller versus receiver role). The findings show that it is possible to recognise behavioural events at levels well above chance level, by employing statistical language models, and that personality traits and conflict handling styles can be partially recognised. Among the factors analysed, participant role (caller versus receiver) was the most important in determining the outcome of negotiation processes in the case of disagreement between parties. Finally, the corpus collected for the experiments (the SSPNet-Mobile Corpus) has been used in an international benchmarking campaign and constitutes a valuable resource for future research in Social Signal Processing and more generally in the area of human-human communication
    corecore