2,828 research outputs found

    Encoding of phonology in a recurrent neural model of grounded speech

    Full text link
    We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and the activations of the layers of the model. Via experiments with phoneme decoding and phoneme discrimination we show that phoneme representations are most salient in the lower layers of the model, where low-level signals are processed at a fine-grained level, although a large amount of phonological information is retain at the top recurrent layer. We further find out that the attention mechanism following the top recurrent layer significantly attenuates encoding of phonology and makes the utterance embeddings much more invariant to synonymy. Moreover, a hierarchical clustering of phoneme representations learned by the network shows an organizational structure of phonemes similar to those proposed in linguistics.Comment: Accepted at CoNLL 201

    Robust lexical access using context sensitive dynamic programming and macro-substitutions

    Get PDF

    The effect of literacy in the speech temporal modulation structure

    Get PDF
    The temporal modulation structure of adult-directed speech is conceptualised as a modulation hierarchy comprising four temporal bands, delta, 1 – 3 Hz, theta, 4 – 8 Hz, beta, 15 – 30 Hz, and low gamma, 30 – 50 Hz. Neuronal oscillatory entrainment to amplitude modulations (AMs) in these four bands may provide a basis for speech encoding and parsing the continuous signal into linguistic units (delta – syllable stress patterns, theta – syllables, beta – onset-rime units, low gamma – phonetic information). While adult-directed speech is theta-dominant and shows tighter theta-beta/low gamma phase alignment, infant-directed speech is delta-dominant and shows tighter delta-theta phase alignment. Although this change in the speech representations could be maturational, it was hypothesized that literacy may also influence the structure of speech. In fact, literacy and schooling are known to change auditory speech entrainment, enhancing phonemic specification and augmenting the phonological detail of the lexicon’s representations. Thus, we hypothesized that a corresponding difference in speech production could also emerge. In this work, spontaneous speech samples were recorded from literate (with lower and higher literacy) and illiterate subjects and their energy modulation spectrum across delta, theta and beta/low gamma AMs as well as the phase synchronization between nested AMs analysed. Measures of the participants’ phonology skills and vocabulary were also retrieved and a specific task to confirm the sensitivity to speech rhythm of the analysis method used (S-AMPH) was conducted. Results showed no differences in the energy of delta, theta and beta/low gamma AMs in spontaneous speech. However, phase alignment between slower and faster speech AMs was significantly enhanced by literacy, showing moderately strong correlations with the phonology measures and literacy. Our data suggest that literacy affects not only cortical entrainment and speech perception but also the physical/rhythmic properties of speech production.A modulação temporal do discurso dirigido a adultos Ă© conceptualizado como uma hierarquia de modulaçÔes em quatro bandas temporais: delta, 1 – 3 Hz, theta, 4 – 8 Hz, beta, 15 – 30 Hz, e low gamma, 30 – 50 Hz. A sincronização das oscilaçÔes neuronais nestas quatro bandas pode providenciar a base para a codificação e anĂĄlise de um sinal contĂ­nuo em unidades linguĂ­sticas (delta – força silĂĄbica, theta – sĂ­labas, beta – arranque/rima, low gamma – informação fonĂ©tica). Enquanto o discurso dirigido a adultos Ă© de um ritmo predominantemente theta e mostra um forte alinhamento entre bandas theta e beta/low gamma, discurso dirigido a crianças Ă© predominantemente de um ritmo delta e mostra maiores sincronizaçÔes entre bandas delta e theta. Apesar das diferenças nas representaçÔes do discurso poderem resultar de processos maturacionais, foi hipotetizado que a literacia tambĂ©m poderia influenciar as caracterĂ­sticas rĂ­tmicas do discurso. De facto, a literacia afecta o processamento auditivo da linguagem, alĂ©m de desenvolver a consciĂȘncia fonĂ©mica e aumentar o detalhe fonolĂłgico das representaçÔes lexicais. Neste estudo foram gravadas amostras de discurso espontĂąneo de sujeitos letrados (alta e baixa escolarização) e iletrados. Os espectros de modulação de energia nas bandas de interesse foram analisados bem como a sincronização das bandas delta-theta e theta-beta/ low gamma. Foram recolhidas medidas de consciĂȘncia fonolĂłgica e vocabulĂĄrio e foi realizada tambĂ©m uma tarefa para confirmar a sensibilidade do modelo de anĂĄlise (S-AMPH) ao ritmo do discurso. A anĂĄlise revelou ausĂȘncia de diferenças na energia nas modulaçÔes delta, theta ou beta/low gamma no discurso espontĂąneo. Contudo, a sincronização entre as bandas aumentou significativamente com a literacia, revelando uma correlação moderada com as medidas de fonologia, vocabulĂĄrio e literacia. Sendo assim, a literacia afecta nĂŁo sĂł a sincronização cortical e Ă  linguagem falada mas tambĂ©m as propriedades fĂ­sicas e rĂ­tmicas da produção do discurso

    A Deep Generative Model of Vowel Formant Typology

    Full text link
    What makes some types of languages more probable than others? For instance, we know that almost all spoken languages contain the vowel phoneme /i/; why should that be? The field of linguistic typology seeks to answer these questions and, thereby, divine the mechanisms that underlie human language. In our work, we tackle the problem of vowel system typology, i.e., we propose a generative probability model of which vowels a language contains. In contrast to previous work, we work directly with the acoustic information -- the first two formant values -- rather than modeling discrete sets of phonemic symbols (IPA). We develop a novel generative probability model and report results based on a corpus of 233 languages.Comment: NAACL 201

    Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

    Get PDF
    We investigate whether infant-directed speech (IDS) could facilitate word form learning when compared to adult-directed speech (ADS). To study this, we examine the distribution of word forms at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS than in ADS. At the phonological level, we find an effect in the opposite direction: the IDS lexicon contains more distinctive words (such as onomatopoeias) than the ADS counterpart. Combining the acoustic and phonological metrics together in a global discriminability score reveals that the bigger separation of lexical categories in the phonological space does not compensate for the opposite effect observed at the acoustic level. As a result, IDS word forms are still globally less discriminable than ADS word forms, even though the effect is numerically small. We discuss the implication of these findings for the view that the functional role of IDS is to improve language learnability.Comment: Draf

    Brain Network Connectivity During Language Comprehension: Interacting Linguistic and Perceptual Subsystems.

    Get PDF
    The dynamic neural processes underlying spoken language comprehension require the real-time integration of general perceptual and specialized linguistic information. We recorded combined electro- and magnetoencephalographic measurements of participants listening to spoken words varying in perceptual and linguistic complexity. Combinatorial linguistic complexity processing was consistently localized to left perisylvian cortices, whereas competition-based perceptual complexity triggered distributed activity over both hemispheres. Functional connectivity showed that linguistically complex words engaged a distributed network of oscillations in the gamma band (20-60 Hz), which only partially overlapped with the network supporting perceptual analysis. Both processes enhanced cross-talk between left temporal regions and bilateral pars orbitalis (BA47). The left-lateralized synchrony between temporal regions and pars opercularis (BA44) was specific to the linguistically complex words, suggesting a specific role of left frontotemporal cross-cortical interactions in morphosyntactic computations. Synchronizations in oscillatory dynamics reveal the transient coupling of functional networks that support specific computational processes in language comprehension.This work was supported by an EPSRC grant to W.M.-W. (EP/F030061/1), an ERC Advanced Grant (Neurolex) to W.M.-W., and by MRC Cognition and Brain Sciences Unit (CBU) funding to W.M.-W. (U.1055.04.002.00001.01). Computing resources were provided by the MRC-CBU. Funding to pay the Open Access publication charges for this article was provided by the Advanced Investigator Grant (Neurolex) to W.D.M.-W.This is the final published version which appears at http://dx.doi.org/10.1093/cercor/bhu28

    A neural oscillations perspective on phonological development and phonological processing in developmental dyslexia

    Get PDF
    Children’s ability to reflect upon and manipulate the sounds in words (’phonological awareness’) develops as part of natural language acquisition, supports reading acquisition, and develops further as reading and spelling are learned. Children with developmental dyslexia typically have impairments in phonological awareness. Many developmental factors contribute to individual differences in phonological development. One important source of individual differences may be the child’s sensory/neural processing of the speech signal from an amplitude modulation (~ energy or intensity variation) perspective, which may affect the quality of the sensory/neural representations (’phonological representations’) that support phonological awareness. During speech encoding, brain electrical rhythms (oscillations, rhythmic variations in neural excitability) re-calibrate their temporal activity to be in time with rhythmic energy variations in the speech signal. The accuracy of this neural alignment or ’entrainment’ process is related to speech intelligibility. Recent neural studies demonstrate atypical oscillatory function at slower rates in children with developmental dyslexia. Potential relations with the development of phonological awareness by children with dyslexia are discussed.Medical Research Council, G0400574 and G090237

    Automatic recognition of schwa variants in spontaneous Hungarian speech

    Get PDF
    This paper analyzes the nature of the process involved in optional vowel reduction in Hungarian, and the acoustic structure of schwa variants in spontaneous speech. The study focuses on the acoustic patterns of both the basic realizations of Hungarian vowels and their realizations as neutral vowels (schwas), as well as on the design, implementation, and evaluation of a set of algorithms for the recognition of both types of realizations from the speech waveform. The authors address the question whether schwas form a unified group of vowels or they show some dependence on the originally intended articulation of the vowel they stand for. The acoustic study uses a database consisting of over 4,000 utterances extracted from continuous speech, and recorded from 19 speakers. The authors propose methods for the recognition of neutral vowels depending on the various vowels they replace in spontaneous speech. Mel-Frequency Cepstral Coefficients are calculated and used for the training of Hidden Markov Models. The recognition system was trained on 2,500 utterances and then tested on 1,500 utterances. The results show that a neutral vowel can be detected in 72% of all occurrences. Stressed and unstressed syllables can be distinguished in 92% of all cases. Neutralized vowels do not form a unified group of phoneme realizations. The pronunciation of schwa heavily depends on the original articulation configuration of the intended vowel
    • 

    corecore