922 research outputs found

    Intonation modelling using a muscle model and perceptually weighted matching pursuit

    Get PDF
    We propose a physiologically based intonation model using perceptual relevance. Motivated by speech synthesis from a speech-to-speech translation (S2ST) point of view, we aim at a language independent way of modelling intonation. The model presented in this paper can be seen as a generalisation of the command response (CR) model, albeit with the same modelling power. It is an additive model which decomposes intonation contours into a sum of critically damped system impulse responses. To decompose the intonation contour, we use a weighted correlation based atom decomposition algorithm (WCAD) built around a matching pursuit framework. The algorithm allows for an arbitrary precision to be reached using an iterative procedure that adds more elementary atoms to the model. Experiments are presented demonstrating that this generalised CR (GCR) model is able to model intonation as would be expected. Experiments also show that the model produces a similar number of parameters or elements as the CR model. We conclude that the GCR model is appropriate as an engineering solution for modelling prosody, and hope that it is a contribution to a deeper scientific understanding of the neurobiological process of intonation

    Lexical Structure and the Nature of Linguistic Representations

    Get PDF
    This dissertation addresses a foundational debate regarding the role of structure and abstraction in linguistic representation, focusing on representations at the lexical level. Under one set of views, positing abstract morphologically-structured representations, words are decomposable into morpheme-level basic units; however, alternative views now challenge the need for abstract structured representation in lexical representation, claiming non-morphological whole-word storage and processing either across-the-board or depending on factors like transparency/productivity/surface form. Our cross-method/cross-linguistic results regarding morphological-level decomposition argue for initial, automatic decomposition, regardless of factors like semantic transparency, surface formal overlap, word frequency, and productivity, contrary to alternative views of the lexicon positing non-decomposition for some or all complex words. Using simultaneous lexical decision and time-sensitive brain activity measurements from magnetoencephalography (MEG), we demonstrate effects of initial, automatic access to morphemic constituents of compounds, regardless of whole-word frequency, lexicalization and length, both in the psychophysical measure (response time) and in the MEG component indexing initial lexical activation (M350), which we also utilize to test distinctions in lexical representation among ambiguous words in a further experiment. Two masked priming studies further demonstrate automatic decomposition of compounds into morphemic constituents, showing equivalent facilitation regardless of semantic transparency. A fragment-priming study with spoken Japanese compounds argues that compounds indeed activate morphemic candidates, even when the surface form of a spoken compound fragment segmentally-mismatches its potential underlying morpheme completion due to a morpho-phonological alternation (rendaku), whereas simplex words do not facilitate segment-mismatching continuations, supporting morphological structure-based prediction regardless of surface-form overlap. A masked priming study on productive and non-productive Japanese de-adjectival nominal derivations shows priming of constituents regardless of productivity, and provides evidence that affixes have independent morphological-level representations. The results together argue that the morpheme, not the word, is the basic unit of lexical processing, supporting a view of lexical representations in which there are abstract morphemes, and revealing immediate, automatic decomposition regardless of semantic transparency, morphological productivity, and surface formal overlap, counter to views in which some/all complex words are treated as unanalyzed wholes. Instead, we conclude that morphologically-complex words are decomposed into abstract morphemic units immediately and automatically by rule, not by exception

    Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

    Get PDF
    Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks

    K + K = 120 : Papers dedicated to László Kálmán and András Kornai on the occasion of their 60th birthdays

    Get PDF

    A SENSORY-MOTOR LINGUISTIC FRAMEWORK FOR HUMAN ACTIVITY UNDERSTANDING

    Get PDF
    We empirically discovered that the space of human actions has a linguistic structure. This is a sensory-motor space consisting of the evolution of joint angles of the human body in movement. The space of human activity has its own phonemes, morphemes, and sentences. We present a Human Activity Language (HAL) for symbolic non-arbitrary representation of sensory and motor information of human activity. This language was learned from large amounts of motion capture data. Kinetology, the phonology of human movement, finds basic primitives for human motion (segmentation) and associates them with symbols (symbolization). This way, kinetology provides a symbolic representation for human movement that allows synthesis, analysis, and symbolic manipulation. We introduce a kinetological system and propose five basic principles on which such a system should be based: compactness, view-invariance, reproducibility, selectivity, and reconstructivity. We demonstrate the kinetological properties of our sensory-motor primitives. Further evaluation is accomplished with experiments on compression and decompression of motion data. The morphology of a human action relates to the inference of essential parts of movement (morpho-kinetology) and its structure (morpho-syntax). To learn morphemes and their structure, we present a grammatical inference methodology and introduce a parallel learning algorithm to induce a grammar system representing a single action. The algorithm infers components of the grammar system as a subset of essential actuators, a CFG grammar for the language of each component representing the motion pattern performed in a single actuator, and synchronization rules modeling coordination among actuators. The syntax of human activities involves the construction of sentences using action morphemes. A sentence may range from a single action morpheme (nuclear syntax) to a sequence of sets of morphemes. A single morpheme is decomposed into analogs of lexical categories: nouns, adjectives, verbs, and adverbs. The sets of morphemes represent simultaneous actions (parallel syntax) and a sequence of movements is related to the concatenation of activities (sequential syntax). We demonstrate this linguistic framework on real motion capture data from a large scale database containing around 200 different actions corresponding to English verbs associated with voluntary meaningful observable movement
    • …
    corecore