35 research outputs found

    Comparative analysis of majority language influence on North Sámi prosody using WaveNet-based modeling

    Get PDF
    The Finnmark North Sami is a variety of North Sami language, an indigenous, endangered minority language spoken in the northernmost parts of Norway and Finland. The speakers of this language are bilingual, and regularly speak the majority language (Finnish or Norwegian) as well as their own North Sami variety. In this paper we investigate possible influences of these majority languages on prosodic characteristics of Finnmark North Sami, and associate them with prosodic patterns prevalent in the majority languages. We present a novel methodology that: (a) automatically finds the portions of speech (words) where the prosodic differences based on majority languages are most robustly manifested; and (b) analyzes the nature of these differences in terms of intonational patterns. For the first step, we trained convolutional WaveNet speech synthesis models on North Sami speech material, modified to contain purely prosodic information, and used conditioning embeddings to find words with the greatest differences between the varieties. The subsequent exploratory analysis suggests that the differences in intonational patterns between the two Finnmark North Sami varieties are not manifested uniformly across word types (based on part-of-speech category). Instead, we argue that the differences reflect phrase-level prosodic characteristics of the majority languages.Peer reviewe

    Nonlinear Dynamic Invariants for Continuous Speech Recognition

    Get PDF
    In this work, nonlinear acoustic information is combined with traditional linear acoustic information in order to produce a noise-robust set of features for speech recognition. Classical acoustic modeling techniques for speech recognition have relied on a standard assumption of linear acoustics where signal processing is primarily performed in the signal\u27s frequency domain. While these conventional techniques have demonstrated good performance under controlled conditions, the performance of these systems suffers significant degradations when the acoustic data is contaminated with previously unseen noise. The objective of this thesis was to determine whether nonlinear dynamic invariants are able to boost speech recognition performance when combined with traditional acoustic features. Several sets of experiments are used to evaluate both clean and noisy speech data. The invariants resulted in a maximum relative increase of 11.1% for the clean evaluation set. However, an average relative decrease of 7.6% was observed for the noise-contaminated evaluation sets. The fact that recognition performance decreased with the use of dynamic invariants suggests that additional research is required for robust filtering of phase spaces constructed from noisy time series

    On the use of colour-based segmentation in evolutionary image composition

    Get PDF
    Part of the IEEE World Congress on Computational Intelliegence (IEEE WCCI 2018). Which includes conferences IJCNN 2018, IEEE CEC 2018 and FUZZ-IEEE 2018.Evolutionary algorithms have been widely used in the area of creativity in order to help create art and music. We consider the recently introduced evolutionary image composition approach based on feature covariance matrices [1] which allows composing two images into a new one based on their feature characteristics. When using evolutionary image composition it is important to obtain a good weighting of interesting regions of the two images. We use colour-based segmentation based on K-Means clustering to come up with such a weighting of the images. Our results show that this preserves the chosen colour regions of the images and leads to composed images that preserve colours better than the previous approach based on saliency masks [1]. Furthermore, we evaluate our composed images in terms of aesthetic feature and show that our approach based on colour-based segmentation leads to higher feature values for most of the investigated features.Aneta Neumann, Frank Neuman

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

    Efficient error correction for speech systems using constrained re-recognition

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 71-75).Efficient error correction of recognition output is a major barrier in the adoption of speech interfaces. This thesis addresses this problem through a novel correction framework and user interface. The system uses constraints provided by the user to enhance re-recognition, correcting errors with minimal user effort and time. In our web interface, users listen to the recognized utterance, marking incorrect words as they hear them. After they have finished marking errors, they submit the edits back to the speech recognizer where it is merged with previous edits and then converted into a finite state transducer. This FST, modeling the regions of correct and incorrect words in the recognition output, is then composed with the recognizer's language model and the utterance is re-recognized. We explored the use of our error correction technique in both the lecture and restaurant domain, evaluating the types of errors and the correction performance in each domain. With our system, we have found significant improvements over other error correction techniques such as n-best lists, re-speaking or verbal corrections, and retyping in terms of actions per correction step, corrected output rate, and ease of use.by Gregory T. Yu.M.Eng

    Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

    Get PDF
    Over the past few years, speech recognition technology performance on tasks ranging from isolated digit recognition to conversational speech has dramatically improved. Performance on limited recognition tasks in noiseree environments is comparable to that achieved by human transcribers. This advancement in automatic speech recognition technology along with an increase in the compute power of mobile devices, standardization of communication protocols, and the explosion in the popularity of the mobile devices, has created an interest in flexible voice interfaces for mobile devices. However, speech recognition performance degrades dramatically in mobile environments which are inherently noisy. In the recent past, a great amount of effort has been spent on the development of front ends based on advanced noise robust approaches. The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO and MFA front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry-standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends

    Pre-aspiration in Bethesda Welsh: a sociophonetic analysis

    Get PDF
    Previous research has shown that pre-aspiration can be either a phonemic or variable linguistic feature susceptible to linguistic and extra-linguistic influences. In the case of Welsh, previous exploratory work has found the presence of pre-aspiration (Ball 1984; Morris 2010; Iosad Forthcoming; Spooner 2016), but the phonetic and phonological properties of this feature and its sociophonetic patterning in the language are not known. This paper presents analyses of the variety of Welsh spoken in Bethesda (Gwynedd). It reports the frequency of occurrence of pre-aspiration, its duration, and noisiness. As well as describing pre-aspiration, it attempts to ascertain the extent to which this feature is influenced by linguistic and extra-linguistic factors. Wordlist data were analysed from 16 Welsh–English bilinguals from Bethesda (Gwynedd, north Wales). Speakers were aged between 16 and 18 years old and the sample was stratified by speaker sex and home language (either Welsh or English). The results indicate that pre-aspiration is frequent in both fortis and lenis plosives (the latter of which are typically devoiced in Welsh). In addition to a number of linguistic influences on its production, both speaker sex and home language were found to be significant predictors of variation for some measures. The results are discussed with reference to previous studies of pre-aspiration in other languages and work on phonetic variation in Welsh-English bilingual speech

    A TUTORIAL ON FORMANT-BASED SPEECH SYNTHESIS FOR THE DOCUMENTATION OF CRITICALLY ENDANGERED LANGUAGES

    Get PDF
    Smaller languages, that is, those spoken by 5,000 people or less are dying at an alarming rate (Krauss 1992). Many are disappearing without having been studied acoustically. The methodology discussed in this paper can help build formant-based speech synthesis systems for the documentation and revitalization of these languages. Developing Text-to-Speech (TTS) functionalities for use in smart devices can breathe a new life into dying languages (Crystal 2000). In the first tutorial on this topic, Koffi (2020) explained how the Arpabet transcription system can be expanded for use in African languages and beyond. In the present tutorial, Author 1 and Author 2 lay the foundations for formant-based speech synthesis patterned after Klatt (1980) and Klatt and Klatt (1990). Betine, (ISO: 639-3-eot), a critically endangered language in Côte d’Ivoire, West Africa, is used to illustrate the processes involved in building a speech synthesis from the ground up for moribund languages. The steps include constructing a language model, a speaker model, a software model, an intonation model, extracting relevant acoustic phonetic data, and coding them. Ancillary topics such as text normalization, downsampling, and bandwidth calculations are also discussed

    Humanistiske data nr 3 1991

    Get PDF
    publishedVersio
    corecore