3,144 research outputs found

    Neural approaches to spoken content embedding

    Full text link
    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi

    Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

    Get PDF
    We investigate whether infant-directed speech (IDS) could facilitate word form learning when compared to adult-directed speech (ADS). To study this, we examine the distribution of word forms at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS than in ADS. At the phonological level, we find an effect in the opposite direction: the IDS lexicon contains more distinctive words (such as onomatopoeias) than the ADS counterpart. Combining the acoustic and phonological metrics together in a global discriminability score reveals that the bigger separation of lexical categories in the phonological space does not compensate for the opposite effect observed at the acoustic level. As a result, IDS word forms are still globally less discriminable than ADS word forms, even though the effect is numerically small. We discuss the implication of these findings for the view that the functional role of IDS is to improve language learnability.Comment: Draf

    Articulatory features for robust visual speech recognition

    Full text link

    The structure and perception of budgerigar (Melopsittacus undulatus) warble songs

    Get PDF
    The warble song of male budgerigars (Melopsittacus undulatus) is an extraordinarily complex, multi-syllabic, learned vocalization that is produced continuously in streams lasting from a few seconds to a few minutes without obvious repetition of particular patterns. As a follow-up of the warble analysis of Farabaugh et al. (1992), an automatic categorization program based on neural networks was developed and used to efficiently and reliably classify more than 25,000 warble elements from 4 budgerigars. The relative proportion of the resultant seven basic acoustic groups and one compound group is similar across individuals. Budgerigars showed higher discriminability of warble elements drawn from different acoustic categories and lower discriminability of warble elements drawn from the same category psychophysically, suggesting that they form seven perceptual categories corresponding to those established acoustically. Budgerigars also perceive individual voice characteristics in addition to the acoustic measures delineating categories. Acoustic analyses of long sequences of natural warble revealed that the elements were not randomly arranged and that warble has at least a 5th-order Markovian structure. Perceptual experiments provided convergent evidence that budgerigars are able to master a novel sequence between 4 and 7 elements in length. Through gradual training with chunking (about 5 elements), birds are able to master sequences up to 50 elements. The ability of budgerigars to detect inserted targets taken in a long, running background of natural warble sequences appears to be species-specific and related to the acoustic structure of warble sounds

    Speaking for listening

    Get PDF
    Speech production is constrained at all levels by the demands of speech perception. The speaker's primary aim is successful communication, and to this end semantic, syntactic and lexical choices are directed by the needs of the listener. Even at the articulatory level, some aspects of production appear to be perceptually constrained, for example the blocking of phonological distortions under certain conditions. An apparent exception to this pattern is word boundary information, which ought to be extremely useful to listeners, but which is not reliably coded in speech. It is argued that the solution to this apparent problem lies in rethinking the concept of the boundary of the lexical access unit. Speech rhythm provides clear information about the location of stressed syllables, and listeners do make use of this information. If stressed syllables can serve as the determinants of word lexical access codes, then once again speakers are providing precisely the necessary form of speech information to facilitate perception

    Symbol Emergence in Robotics: A Survey

    Full text link
    Humans can learn the use of language through physical interaction with their environment and semiotic communication with other people. It is very important to obtain a computational understanding of how humans can form a symbol system and obtain semiotic skills through their autonomous mental development. Recently, many studies have been conducted on the construction of robotic systems and machine-learning methods that can learn the use of language through embodied multimodal interaction with their environment and other systems. Understanding human social interactions and developing a robot that can smoothly communicate with human users in the long term, requires an understanding of the dynamics of symbol systems and is crucially important. The embodied cognition and social interaction of participants gradually change a symbol system in a constructive manner. In this paper, we introduce a field of research called symbol emergence in robotics (SER). SER is a constructive approach towards an emergent symbol system. The emergent symbol system is socially self-organized through both semiotic communications and physical interactions with autonomous cognitive developmental agents, i.e., humans and developmental robots. Specifically, we describe some state-of-art research topics concerning SER, e.g., multimodal categorization, word discovery, and a double articulation analysis, that enable a robot to obtain words and their embodied meanings from raw sensory--motor information, including visual information, haptic information, auditory information, and acoustic speech signals, in a totally unsupervised manner. Finally, we suggest future directions of research in SER.Comment: submitted to Advanced Robotic

    Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic

    Get PDF
    Language modeling for an inflected language such as Arabic poses new challenges for speech recognition and machine translation due to its rich morphology. Rich morphology results in large increases in out-of-vocabulary (OOV) rate and poor language model parameter estimation in the absence of large quantities of data. In this study, we present a joint morphological-lexical language model (JMLLM) that takes advantage of Arabic morphology. JMLLM combines morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items in a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates while keeping the predictive power of whole words. Speech recognition and machine translation experiments in dialectal-Arabic show improvements over word and morpheme based trigram language models. We also show that as the tightness of integration between different information sources increases, both speech recognition and machine translation performances improve

    Robust parameters for automatic segmentation of speech

    Full text link
    Automatic segmentation of speech ir on important problem that is useful in speed recognition, synthesis and coding. We explore in this paper: the robust parameter set, weightingfunction and distance measure for reliable segmentation of noisy speech. It is found that the MFCC parometers, successful in speech recognition. holds the best promise far robust segmentation also. We also explored a variery of symmetric and asymmetric weighting lifter, from which it is found that a symmetric lifter of the form 1+Asin1/2(πn/L)1+{Asin^{1/2}}{(\pi n/L)}, 0≤n≤L−1{0}\leq{n}\leq{L-1}, for MFCC dimension L, is most effective. With regard to distance measure. the direct L2L_2 norm is found adequate
    • …
    corecore