1,497 research outputs found

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Multimodaalsel emotsioonide tuvastamisel pƵhineva inimese-roboti suhtluse arendamine

    Get PDF
    VƤitekirja elektrooniline versioon ei sisalda publikatsiooneƜks afektiivse arvutiteaduse peamistest huviobjektidest on mitmemodaalne emotsioonituvastus, mis leiab rakendust peamiselt inimese-arvuti interaktsioonis. Emotsiooni Ƥratundmiseks uuritakse nendes sĆ¼steemides nii inimese nƤoilmeid kui kakƵnet. KƤesolevas tƶƶs uuritakse inimese emotsioonide ja nende avaldumise visuaalseid ja akustilisi tunnuseid, et tƶƶtada vƤlja automaatne multimodaalne emotsioonituvastussĆ¼steem. KƵnest arvutatakse mel-sageduse kepstri kordajad, helisignaali erinevate komponentide energiad ja prosoodilised nƤitajad. NƤoilmeteanalĆ¼Ć¼simiseks kasutatakse kahte erinevat strateegiat. Esiteks arvutatakse inimesenƤo tƤhtsamate punktide vahelised erinevad geomeetrilised suhted. Teiseks vƵetakse emotsionaalse sisuga video kokku vƤhendatud hulgaks pƵhikaadriteks, misantakse sisendiks konvolutsioonilisele tehisnƤrvivƵrgule emotsioonide visuaalsekseristamiseks. Kolme klassifitseerija vƤljunditest (1 akustiline, 2 visuaalset) koostatakse uus kogum tunnuseid, mida kasutatakse Ƶppimiseks sĆ¼steemi viimasesetapis. Loodud sĆ¼steemi katsetati SAVEE, Poola ja Serbia emotsionaalse kƵneandmebaaside, eNTERFACEā€™05 ja RML andmebaaside peal. Saadud tulemusednƤitavad, et vƵrreldes olemasolevatega vƵimaldab kƤesoleva tƶƶ raames loodudsĆ¼steem suuremat tƤpsust emotsioonide Ƥratundmisel. Lisaks anname kƤesolevastƶƶs Ć¼levaate kirjanduses vƤljapakutud sĆ¼steemidest, millel on vƵimekus tunda Ƥraemotsiooniga seotud Ģ†zeste. Selle Ć¼levaate eesmƤrgiks on hƵlbustada uute uurimissuundade leidmist, mis aitaksid lisada tƶƶ raames loodud sĆ¼steemile Ģ†zestipƵhiseemotsioonituvastuse vƵimekuse, et veelgi enam tƵsta sĆ¼steemi emotsioonide Ƥratundmise tƤpsust.Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework

    The effects of English proficiency on the processing of Bulgarian-accented English by Bulgarian-English bilinguals

    Get PDF
    This dissertation explores the potential benefit of listening to and with oneā€™s first-language accent, as suggested by the Interspeech Intelligibility Benefit Hypothesis (ISIB). Previous studies have not consistently supported this hypothesis. According to major second language learning theories, the listenerā€™s second language proficiency determines the extent to which the listener relies on their first language phonetics. Hence, this thesis provides a novel approach by focusing on the role of English proficiency in the understanding of Bulgarian-accented English for Bulgarian-English bilinguals. The first experiment investigated whether evoking the listenersā€™ L1 Bulgarian phonetics would improve the speed of processing Bulgarian-accented English words, compared to Standard British English words, and vice versa. Listeners with lower English proficiency processed Bulgarian-accented English faster than SBE, while high proficiency listeners tended to have an advantage with SBE over Bulgarian accent. The second experiment measured the accuracy and reaction times (RT) in a lexical decision task with single-word stimuli produced by two L1 English speakers and two Bulgarian-English bilinguals. Listeners with high proficiency in English responded slower and less accurately to Bulgarian-accented speech compared to L1 English speech and compared to lower proficiency listeners. These accent preferences were also supported by the listenerā€™s RT adaptation across the first experimental block. A follow-up investigation compared the results of L1 UK English listeners to the bilingual listeners with the highest proficiency in English. The L1 English listeners and the bilinguals processed both accents with similar speed, accuracy and adaptation patterns, showing no advantage or disadvantage for the bilinguals. These studies support existing models of second language phonetics. Higher proficiency in L2 is associated with lesser reliance on L1 phonetics during speech processing. In addition, the listeners with the highest English proficiency had no advantage when understanding Bulgarian-accented English compared to L1 English listeners, contrary to ISIB. Keywords: Bulgarian-English bilinguals, bilingual speech processing, L2 phonetic development, lexical decision, proficienc

    Time-frequency distributions for automatic speech recognition

    Full text link

    Allophones, not phonemes in spoken-word recognition

    Get PDF
    We thank Nadia Klijn for helping to prepare and test participants in Experiment 1 and Rosa Franzke for help with Experiments 2 and 3. The second author is funded by an Emmy-Noether grant (nr. RE 3047/1-1) from the German Research Council (DFG). This work was also supported by a University of Malta Research Grant to the first author.What are the phonological representations that listeners use to map information about the segmental content of speech onto the mental lexicon during spoken-word recognition? Recent evidence from perceptual-learning paradigms seems to support (context-dependent) allophones as the basic representational units in spoken-word recognition. But recent evidence from a selective-adaptation paradigm seems to suggest that context-independent phonemes also play a role. We present three experiments using selective adaptation that constitute strong tests of these representational hypotheses. In Experiment 1, we tested generalization of selective adaptation using different allophones of Dutch /r/ and /l/ ā€“ a case where generalization has not been found with perceptual learning. In Experiments 2 and 3, we tested generalization of selective adaptation using German back fricatives in which allophonic and phonemic identity were varied orthogonally. In all three experiments, selective adaptation was observed only if adaptors and test stimuli shared allophones. Phonemic identity, in contrast, was neither necessary nor sufficient for generalization of selective adaptation to occur. These findings and other recent data using the perceptual-learning paradigm suggest that pre-lexical processing during spoken-word recognition is based on allophones, and not on context-independent phonemes.peer-reviewe

    Idealized computational models for auditory receptive fields

    Full text link
    This paper presents a theory by which idealized models of auditory receptive fields can be derived in a principled axiomatic manner, from a set of structural properties to enable invariance of receptive field responses under natural sound transformations and ensure internal consistency between spectro-temporal receptive fields at different temporal and spectral scales. For defining a time-frequency transformation of a purely temporal sound signal, it is shown that the framework allows for a new way of deriving the Gabor and Gammatone filters as well as a novel family of generalized Gammatone filters, with additional degrees of freedom to obtain different trade-offs between the spectral selectivity and the temporal delay of time-causal temporal window functions. When applied to the definition of a second-layer of receptive fields from a spectrogram, it is shown that the framework leads to two canonical families of spectro-temporal receptive fields, in terms of spectro-temporal derivatives of either spectro-temporal Gaussian kernels for non-causal time or the combination of a time-causal generalized Gammatone filter over the temporal domain and a Gaussian filter over the logspectral domain. For each filter family, the spectro-temporal receptive fields can be either separable over the time-frequency domain or be adapted to local glissando transformations that represent variations in logarithmic frequencies over time. Within each domain of either non-causal or time-causal time, these receptive field families are derived by uniqueness from the assumptions. It is demonstrated how the presented framework allows for computation of basic auditory features for audio processing and that it leads to predictions about auditory receptive fields with good qualitative similarity to biological receptive fields measured in the inferior colliculus (ICC) and primary auditory cortex (A1) of mammals.Comment: 55 pages, 22 figures, 3 table

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
    • ā€¦
    corecore