1,203 research outputs found

    Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map

    Full text link
    Temporal dynamic models for text-independent speaker verification extract consistent speaker information regardless of phonemes by using temporal dynamic CNN (TDY-CNN) in which kernels adapt to each time bin. However, TDY-CNN shows limitations that the model is too large and does not guarantee the diversity of adaptive kernels. To address these limitations, we propose decomposed temporal dynamic CNN (DTDY-CNN) that makes adaptive kernel by combining static kernel and dynamic residual based on matrix decomposition. The baseline model using DTDY-CNN maintained speaker verification performance while reducing the number of model parameters by 35% compared to the model using TDY-CNN. In addition, detailed behaviors of temporal dynamic models on extraction of speaker information was explained using speaker activation maps (SAM) modified from gradient-weighted class activation mapping (Grad-CAM). In DTDY-CNN, the static kernel activates voiced features of utterances, and the dynamic residual activates unvoiced high-frequency features of phonemes. DTDY-CNN effectively extracts speaker information from not only formant frequencies and harmonics but also detailed unvoiced phonemes' information, thus explaining its outstanding performance on text-independent speaker verification.Comment: Submitted to InterSpeech 202

    Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input

    Get PDF
    International audienceBefore they even speak, infants become attuned to the sounds of the language(s) they hear, processing native phonetic contrasts more easily than non-native ones. For example, between 6-8 months and 10-12 months, infants learning American English get better at distinguishing English [ɹ] and [l], as in ‘rock’ vs ‘lock’, relative to infants learning Japanese. Influential accounts of this early phonetic learning phenomenon initially proposed that infants group sounds into native vowel- and consonant-like phonetic categories—like [ɹ] and [l] in English—through a statistical clustering mechanism dubbed ‘distributional learning’. The feasibility of this mechanism for learning phonetic categories has been challenged, however. Here we demonstrate that a distributional learning algorithm operating on naturalistic speech can predict early phonetic learning as observed in Japanese and American English infants, suggesting that infants might learn through distributional learning after all. We further show, however, that contrary to the original distributional learning proposal, our model learns units too brief and too fine-grained acoustically to correspond to phonetic categories. This challenges the influential idea that what infants learn are phonetic categories. More broadly, our work introduces a novel mechanism-driven approach to the study of early phonetic learning, together with a quantitative modeling framework that can handle realistic input. This allows, for the first time, accounts of early phonetic learning to be linked to concrete, systematic predictions regarding infants’ attunement

    A computational model for studying L1’s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    Phoneme Weighting and Energy-Based Weighting for Speaker Recognition

    Get PDF
    This dissertation focuses on determining specific vowel phonemes which work best for speaker identification and speaker verification, and also developing new algorithms to improve speaker identification accuracy. Results from the first part of our research indicate that the vowels /i/, /E/ and /u/ were the ones having the highest recognition scores for both the Gaussian mixture model (GMM) and vector quantization (VQ) methods (at most one classification error). For VQ, /i/, /I/, /e/, /E/ and /@/ had no classification errors. Persons speaking /E/, /o/ and /u/ have been verified well by both GMM and VQ methods in our experiments. For VQ, the verification results are consistent with the identification results since the same five phonemes performed the best and had less than one verification error. After determining several ideal vowel phonemes, we developed new algorithms for improved speaker identification accuracy. Phoneme weighting methods (which performed classification based on the ideal phonemes we found from the previous experiments) and other weighting methods based on energy were used. The energy weighting methods performed better than the phoneme weighting methods in our experiments. The first energy weighting method ignores the speech frames which have relatively small magnitude. Instead of ignoring the frames which have relatively small magnitude, the second method emphasizes speech frames which have relatively large magnitude. The third method and the adjusted third method are a combination of the previous two methods. The error reduction rate was 7.9% after applying the first method relative to a baseline system (which used Mel frequency cepstral coefficients (MFCCs) as feature and VQ as classifier). After applying the second method and the adjusted third method, the error reduction rate was 28.9% relative to a baseline system

    Evaluation of room acoustic qualities and defects by use of auralization

    Get PDF

    Sentence repetition in adolescents with specific language impairments and autism: an investigation of complex syntax

    Get PDF
    Background: Recent studies have indicated that many children with autism spectrum disorders present with language difficulties that are similar to those of children with specific language impairments, leading some to argue for similar structural deficits in these two disorders. Aims: Repetition of sentences involving long-distance dependencies was used to investigate complex syntax in these groups. Methods & Procedures: Adolescents with specific language impairments (mean age = 15;3, n = 14) and autism spectrum disorders plus language impairment (autism plus language impairment; mean age = 14;8, n = 16) were recruited alongside typically developing adolescents (mean age = 14;4, n = 17). They were required to repeat sentences containing relative clauses that varied in syntactic complexity. Outcomes & Results: The adolescents with specific language impairments presented with greater syntactic difficulties than the adolescents with autism plus language impairment, as manifested by higher error rates on the more complex object relative clauses, and a greater tendency to make syntactic changes during repetition. Conclusions & Implications: Adolescents with specific language impairments may have more severe syntactic difficulties than adolescents with autism plus language impairment, possibly due to their short-term memory limitations
    • …
    corecore