110,668 research outputs found

    Examining speaker variability using low-dimensional and high-dimensional phonetic representations

    Get PDF
    Speech events are unique; speakers do not produce the same sound in exactly the same way twice. They vary their speech depending on a whole range of factors – speaker internal (emotion, health, etc.) and speaker external (interlocutor, topic, environment, etc.). Intra- speaker variation is significant because it is a leading cause of incorrect speaker identification (Zhang et al, 2006), shows socially-meaningful patterning (Podesva, 2007), and can represent potential cues to the origin and spread of sound change (Mielke et al. 2019). Holistically tracking speaker variability is, however, very challenging. For example, different linguistic features may show different degrees of variability and different measurements may also produce different conclusions (Rhodes 2012). This study aims to address these issues by examining the accuracy of speaker classification across multiple samples per speaker, focussing on comparing different phonetic representations. A speaker classification experiment was conducted on 20 male speakers aged 18-24, from two UK dialects (Manchester and Newcastle; Haddican & Foulkes 2017). Three 30-second spontaneous speech samples were extracted for each speaker at three different time points from within the same recording, which allows us to examine the robustness of different speaker modelling methods in light of within-speaker variation. Specifically, we compare vowel formants (a low-dimensional representation) and 13 MFCCs (a high-dimensional representation) for each speaker in order to observe how the multiple samples from each speaker cluster together across these representations. Clustering was performed using Gaussian Mixture Models (GMMs) and agglomerative cluster analyses, while the success of each model was assessed in terms of how accurately it classified speech samples from the same speaker. The results show that the extent of intraspeaker variation is sufficient to inhibit accurate speaker classification. MFCCs performed better than formant measurements in identifying contemporaneous samples within speakers, although the effect of formants vs MFCCs was also variable between speakers. This points towards differential weighting of information between speakers in determining speaker individuality. These results are discussed in terms of the extent of speaker variability and the need for greater interpretability in high- dimensional feature sets. I further outline some remaining challenges in the study of intra- speaker variation and its relevance to applied phonetics

    Rhetorical questions as aggressive, friendly or sarcastic/ironical questions with imposed answers

    Get PDF
    Rhetorical questions (RQs), as a cross-breed of questions and statements, represent an effective tool in putting forward the Speaker\u27s ideas, as well as influencing the ideas and opinions of other people. Because of their communicative effectiveness and multifunctionality, they are frequently used in different contexts and for different purposes, and, as such, they represent an interesting topic for further research. The aim of this paper is threefold: (i) to explore the nature of the implied answer to RQs, (ii) to offer a classification of RQs based on the Speaker\u27s communication style, and (iii) to examine whether (or to what extent) the Speaker-Addressee relationship (peer-to-peer, superior-to-inferior, inferior-to-superior) influences the selection and frequency of use of different types of RQs. Using Stalnaker’s (2002) model of Common Ground and Caponigro and Sprouse’s (2007) concepts of Speaker\u27s and Addressee\u27s Beliefs, the author redefines the nature of the answers implied by RQs, claiming that they are imposed on the Addressee rather than mutually recognized as obvious. Based on the model of communication styles as defined by Yuan et al. (2018), RQs are classified into aggressive, friendly and sarcastic/ironical questions with imposed answers. The analysis of the corpus, which consisted of 275 RQs taken from ten American movie scripts, showed that friendly RQs are more common than the other two types, and that, in instances where one of the interlocutors is in a superior position, superior-to-inferior RQs are by far more common than vice versa. The finding that RQs asked by inferiors make up less than a third of RQs occurring between interlocutors with different social standing is in line with the view that answers to RQs are imposed on Addressees

    詩における指示詞の役割:金子みすゞの詩を例に

    Get PDF
    In this article, I analyze poems by Misuzu Kaneko, focusing on how Japanese demonstratives help the reader perceive the world the poet creates. The analysis is based on the classification of Japanese demonstratives discussed in Iori (2007).First, demonstratives with exophoric reference based on spatial relations presuppose concrete relations between the referent, the speaker, and the hearer in space. In the case of poems, there is one more perspective involved: the reader, who may or may not be the same as the hearer. This means that a poem may assume the existence of two distinct worlds, one with the speaker (and the hearer) in it, and the other with the reader and the poem in it. The demonstrative kono (this) with exophoric reference based on a spatial relationship can bring the two worlds together, as in the poem Fusuma-no E (The Picture on the Sliding Doors), in which the poet pretends to be in (a picture of) a forest on sliding doors. Not only does she succeed in making the reader feel that she is close by, but also she takes the reader into another world deep in the forest. In the poem Kono Michi (This Path), on the other hand, the same demonstrative captures the moment the poet comes out of the poem and urges the reader to follow the path that lies ahead, with the poet walking beside them.Second, demonstratives with exophoric reference based on a notional relationship signal that the referent is in the speaker\u27s memory. They are used in cases where the speaker and the hearer share information or experience, or in monologues. In poems, this usage prompts the reader to infer that the referent has had a certain relationship with the speaker in the past. The demonstrative in the first stanza of Naka-naori (Friends Again) tells the reader that it is a monologue, which in turn implies that the referent of ano ko (that girl) is the speaker\u27s only friend. The poem Wasureta Uta (The Song I Cannot Recall) uses ano (that) based on a notional relationship and kono (this) based on a spatial relation. Together they emphasize the gap between the speaker\u27s distant memory of her mother and the reality in which she cannot even recall the second half of the song her mother used to sing.Finally, demonstratives with endophoric reference build cohesive relations among the expressions in the discourse. In Hasu-to Niwatori (The Lotus and the Chicken), the use of cohesive relations in a recursive pattern creates a structure that suggests to the reader that the being that watches over and guides the lotus, the chicken, and the speaker\u27s consciousness, is perhaps even watching over the reader as well

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Prosodic Event Recognition using Convolutional Neural Networks with Context Information

    Full text link
    This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features. Typical approaches use not only feature representations of the word in question but also its surrounding context. We show that adding position features indicating the current word benefits the CNN. In addition, this paper discusses the generalization from a speaker-dependent modelling approach to a speaker-independent setup. The proposed method is simple and efficient and yields strong results not only in speaker-dependent but also speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur

    Voicing classification of visual speech using convolutional neural networks

    Get PDF
    The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection
    corecore