697 research outputs found

    Shared-hidden-layer Deep Neural Network for Under-resourced Language the Content

    Get PDF
    Training speech recognizer with under-resourced language data still proves difficult. Indonesian language is considered under-resourced because the lack of a standard speech corpus, text corpus, and dictionary. In this research, the efficacy of augmenting limited Indonesian speech training data with highly-resourced-language training data, such as English, to train Indonesian speech recognizer was analyzed. The training was performed in form of shared-hidden-layer deep-neural-network (SHL-DNN) training. An SHL-DNN has language-independent hidden layers and can be pre-trained and trained using multilingual training data without any difference with a monolingual deep neural network. The SHL-DNN using Indonesian and English speech training data proved effective for decreasing word error rate (WER) in decoding Indonesian dictated-speech by achieving 3.82% absolute decrease compared to a monolingual Indonesian hidden Markov model using Gaussian mixture model emission (GMM-HMM). The case was confirmed when the SHL-DNN was also employed to decode Indonesian spontaneous-speech by achieving 4.19% absolute WER decrease

    A Cognitive Science Reasoning in Recognition of Emotions in Audio-Visual Speech

    Get PDF
    In this report we summarize the state-of-the-art of speech emotion recognition from the signal processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the observation is made that existing approaches for supervised machine learning lead to database dependent classifiers which can not be applied for multi-language speech emotion recognition without additional training because they discriminate the emotion classes following the used training language. As there are experimental results showing that Humans can perform language independent categorisation, we made a parallel between machine recognition and the cognitive process and tried to discover the sources of these divergent results. The analysis suggests that the main difference is that the speech perception allows extraction of language independent features although language dependent features are incorporated in all levels of the speech signal and play as a strong discriminative function in human perception. Based on several results in related domains, we have suggested that in addition, the cognitive process of emotion-recognition is based on categorisation, assisted by some hierarchical structure of the emotional categories, existing in the cognitive space of all humans. We propose a strategy for developing language independent machine emotion recognition, related to the identification of language independent speech features and the use of additional information from visual (expression) features

    Voice Recognition Systems for The Disabled Electorate: Critical Review on Architectures and Authentication Strategies

    Get PDF
    An inevitable factor that makes the concept of electronic voting irresistible is the fact that it offers the possibility of exceeding the manual voting process in terms of convenience, widespread participation, and consideration for People Living with Disabilities. The underlying voting technology and ballot design can determine the credibility of election results, influence how voters felt about their ability to exercise their right to vote, and their willingness to accept the legitimacy of electoral results. However, the adoption of e-voting systems has unveiled a new set of problems such as security threats, trust, and reliability of voting systems and the electoral process itself. This paper presents a critical literature review on concepts, architectures, and existing authentication strategies in voice recognition systems for the e-voting system for the disabled electorate. Consequently, in this paper, an intelligent yet secure scheme for electronic voting systems specifically for people living with disabilities is presented

    A cross-cultural investigation of the vocal correlates of emotion

    Get PDF
    PhD ThesisUniversal and culture-specific properties of the vocal communication of human emotion are investigated in this balanced study focussing on encoding and decoding of Happy, Sad, Angry, Fearful and Calm by English and Japanese participants (eight female encoders for each culture, and eight female and eight male decoders for each culture). Previous methodologies and findings are compared. This investigation is novel in the design of symmetrical procedures to facilitate cross-cultural comparison of results of decoding tests and acoustic analysis; a simulation/self-induction method was used in which participants from both cultures produced, as far as possible, the same pseudo-utterances. All emotions were distinguished beyond chance irrespective of culture, except for Japanese participants’ decoding of English Fearful, which was decoded at a level borderline with chance. Angry and Sad were well-recognised, both in-group and cross-culturally and Happy was identified well in-group. Confusions between emotions tended to follow dimensional lines of arousal or valence. Acoustic analysis found significant distinctions between all emotions for each culture, except between the two low arousal emotions Sad and Calm. Evidence of ‘In-Group Advantage’ was found for English decoding of Happy, Fearful and Calm and for Japanese decoding of Happy; there is support for previous evidence of East/West cultural differences in display rules. A novel concept is suggested for the finding that Japanese decoders identified Happy, Sad and Angry more reliably from English than from Japanese expressions. Whilst duration, fundamental frequency and intensity all contributed to distinctions between emotions for English, only measures of fundamental frequency were found to significantly distinguish emotions in Japanese. Acoustic cues tended to be less salient in Japanese than in English when compared to expected cues for high and low arousal emotions. In addition, new evidence was found of cross-cultural influence of vowel quality upon emotion recognition

    An Investigation of Intelligibility and Lingua Franca Core Features in Indonesian Accented English

    Get PDF
    Recent approaches to teaching pronunciation of English in second or foreign language contexts have favoured the role of students’ L1 accents in the teaching and learning process with the emphasis on intelligibility and the use of English as a Lingua Franca rather than on achieving native like pronunciation. As far as English teaching in Indonesia is concerned, there is limited information on the intelligibility of Indonesian Accented English, as well as insufficient guidance on key pronunciation features for effective teaching. This research investigates features of Indonesian Accented English and critically assesses the intelligibility of different levels of Indonesian Accented English.English Speech data were elicited from 50 Indonesian speakers using reading texts. Key phonological features of Indonesian Accented English were investigated through acoustic analysis involving spectrographic observation using Praat Speech Analysis software. The intelligibility of different levels of Indonesian Accented English was measured using a transcription task performed by 24 native and non-native English listeners. The overall intelligibility of each accent was measured by examining the correctness of the transcriptions. The key pronunciation features which caused intelligibility failure were identified by analysing the incorrect transcriptions.The analysis of the key phonological features of Indonesian Accented English showed that while there was some degree of regularity in the production of vowel duration and consonant clusters, more individual variations were observed in segmental features particularly in the production of consonants /v, z, ʃ/ which are absent in the Indonesian phonemic inventory. The results of the intelligibility analysis revealed that although light and moderate accented speech data were significantly more intelligible than the heavier accented speech data, the native and non-native listeners did not have major problems with the intelligibility of Indonesian Accented English across the different accent levels. The analysis of incorrect transcriptions suggested that intelligibility failures were associated more with combined phonological miscues rather than a single factor. These results indicate that while Indonesian Accented English can be used effectively in international communication, it can also inform English language teaching in Indonesia

    Recognition and cortical haemodynamics of vocal emotions-an fNIRS perspective

    Get PDF
    Normal-hearing listeners rely heavily on variations in the fundamental frequency (F0) of speech to identify vocal emotions. Without reliable F0 cues, as is the case for cochlear implant users, listeners’ ability to extract emotional meaning from speech is reduced. This thesis describes the development of an objective measure of vocal emotion recognition. The program of three experiments investigates: 1) NH listeners’ abilities to use F0, intensity, and speech-rate cues to recognise emotions; 2) cortical activity associated with individual vocal emotions assessed using functional near-infrared spectroscopy (fNIRS); 3) cortical activity evoked by vocal emotions in natural speech and in speech with uninformative F0 using fNIRS

    Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors

    Full text link
    Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability. However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data. First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal. Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion. The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora

    Ultrasound cleaning of microfilters

    Get PDF

    Foreigner-directed speech and L2 speech learning in an understudied interactional setting: the case of foreign-domestic helpers in Oman

    Get PDF
    Ph. D. (Integrated) ThesisSet in Arabic-speaking Oman, the present study investigates whether speech directed to foreign domestic helpers (FDH-directed speech) is modified when compared with speech addressed to native Arabic speakers. It also explores the FDH’s ability to learn the sound system of their L2 in a near-naturalistic setting. In relation to input, the study explores whether there are any adaptations in native speakers’ realizations of complex Arabic consonants, consonant clusters, and vowels in FDH-directed speech. By doing so, it compares the phonetic features of FDH-directed speech in relation to other speech registers such as foreigner-directed speech (FDS), infant-directed speech (IDS) and clear speech. The study also investigates whether foreign accentedness, religion and Arabic language experience, as indexed by length of residence (LoR), play a role in the extent of adaptations present in FDH-directed speech. In relation to L2 speech learning, the study investigates the extent to which FDHs are sensitive to the phonemic contrasts of Arabic and whether their production of complex Arabic consonants and consonant clusters is target-like. It also examines the social and linguistic factors (LoR, first and second language literacy) that play a role in the learnability of these sounds. Speech recordings were collected from 22 Omani female native Arabic speakers who interacted 1) with their FDHs and 2) with a native-speaking adult (the order was reversed for half of the participants), in both instances using a spot the difference task. A picture naming task was then used to collect data for production data by the same FDHs, while perception data consisted of an AX forced choice task. Results demonstrate the distinctiveness of FDH-directed speech from other speech registers. Neither simplification of complex sounds nor hyperarticulation of consonant contrasts were attested in FDH-directed speech, despite them being reported in other studies on FDS and IDS. We attribute this to the familiarity of the native speakers with their FDHs and the formulaic nature of their daily interactions. Expansion of vowel space was evident in this study, conforming with other FDS studies. Results from perception and production tasks revealed that FDHs fell short of native-like performance, despite the more naturalistic setting and regardless of LoR. L1 and L2 literacy played varying roles in FDHs’ phonological sensitivity and production of certain contrasts. The study is original is terms of showing that FDS is not an automatic outcome of interactions with L2 speakers and links these results with the unusual social setting
    • …
    corecore