393 research outputs found

    Saudi Accented Arabic Voice Bank

    Get PDF
    AbstractThe aim of this paper is to present an Arabic speech database that represents Arabic native speakers from all the cities of Saudi Arabia. The database is called the Saudi Accented Arabic Voice Bank (SAAVB). Preparing the prompt sheets, selecting the right speakers and transcribing their speech are some of the challenges that faced the project team. The procedures that meet these challenges are highlighted. SAAVB consists of 1033 speakers speak in Modern Standard Arabic with a Saudi accent. The SAAVB content is analyzed and the results are illustrated. The content was verified internally and externally by IBM Cairo and can be used to train speech engines such as automatic speech recognition and speaker verification systems

    Voice-controlled in-vehicle infotainment system

    Get PDF
    Abstract. Speech is a form of a human to human communication that can convey information in a context-rich way that is natural to humans. The naturalness enables us to speak while doing other things, such as driving a vehicle. With the advancement of computing technologies, more and more personal services are introduced for the in-vehicle environment. A limiting factor for these advancements is the impact they cause towards driver distraction with the increased cognitive stress load. This has led to developing in-vehicle devices and applications with a heightened focus on lessening distraction. Amazon Alexa is a natural language processing system that enables its users to receive information and operate smart devices with their voices. This Master’s thesis aims to demonstrate how Alexa could be utilized when operating the in-vehicle infotainment (IVI) systems. This research was conducted by utilizing the design science research methodology. The feasibility of voice-based interaction was assessed by implementing the system as a demonstrable use-case in collaboration with the APPSTACLE project. Prior research was gathered by conducting a literature review on voice-based interaction and its integration to the vehicular domain. The system was designed by applying existing theories together with the requirements of the application domain. The designed system utilized the Amazon Alexa ecosystem and AWS services to provide the vehicular environment with new functionalities. Access to cloud-based speech processing and decision-making makes it possible to design an extendable speech interface where the driver can carry out secondary tasks by using their voice, such as requesting navigation information. The evaluation was done by comparing the system’s performance against the derived requirements. With the results of the evaluation process, the feasibility of the system could be assessed against the objectives of the study: The resulting artefact enables the user to operate the in-vehicle infotainment system while focusing on a separate task. The research proved that speech interfaces with modern technology can improve the handling of secondary tasks while driving, and the resulting system was operable without introducing additional distractions to the driver. The resulting artefact can be integrated into similar systems and used as a base tool for future research on voice-controlled interfaces

    Automatic Accent Recognition Systems and the Effects of Data on Performance

    Get PDF
    This paper considers automatic accent recognition system performance in relation to the specific nature of the accent data. This is of relevance to the forensic application, where an accent recogniser may have a place in casework involving various accent classification tasks with different challenges attached. The study presented here is composed of two main parts. Firstly, it examines the performance of five different automatic accent recognition systems when distinguishing between geographically-proximate accents. Using geographically-proximate accents is expected to challenge the systems by increasing the degree of similarity between the varieties we are trying to distinguish between. The second part of the study is concerned with identifying the specific phonemes which are important in a given accent recognition task, and eliminating those which are not. Depending on the varieties we are classifying, the phonemes which are most useful to the task will vary. This study therefore integrates feature selection methods into the accent recognition system shown to be the highest performer, the Y-ACCDIST-SVM system, to help to identify the most valuable speech segments and to increase accent recognition rates

    The effect of hyperarticulation on speech comprehension under adverse listening conditions

    Get PDF
    © 2021 The Authors. Published by Springer. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://doi.org/10.1007/s00426-021-01595-2Comprehension assesses a listener’s ability to construe the meaning of an acoustic signal in order to be able to answer questions about its contents, while intelligibility indicates the extent to which a listener can precisely retrieve the acoustic signal. Previous comprehension studies asking listeners for sentence-level information or narrative-level information used native listeners as participants. This is the first study to look at whether clear speech properties (e.g. expanded vowel space) produce a clear speech benefit at the word level for L2 learners for speech produced in naturalistic settings. This study explored whether hyperarticulated speech was more comprehensible than non-hyperarticulated speech for both L1 British English speakers and early and late L2 British English learners in quiet and in noise. Sixteen British English listeners, 16 native Mandarin Chinese listeners as early learners of L2 and 16 native Mandarin Chinese listeners as late learners of L2 rated hyperarticulated samples versus non-hyperarticulated samples in form of words for comprehension under four listening conditions of varying white noise level (quiet or SNR levels of +16dB, +12dB or +8dB) (3x2x4 mixed design). Mean ratings showed all three groups found hyperarticulated speech samples easier to understand than non-hyperarticulated speech at all listening conditions. Results are discussed in terms of other findings (Uther et al., 2012) that suggest that hyperarticulation may generally improve speech processing for all language groups.The studies were funded by an Isambard Scholarship from Brunel University

    Investigating the build-up of precedence effect using reflection masking

    Get PDF
    The auditory processing level involved in the build‐up of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874–884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottom‐up approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5–15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5‐ and 10‐ms delay, no change in threshold was observed. It is concluded that the low‐level auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels

    Methods for pronunciation assessment in computer aided language learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D
    • 

    corecore