18,413 research outputs found

    Neural population coding: combining insights from microscopic and mass signals

    Get PDF
    Behavior relies on the distributed and coordinated activity of neural populations. Population activity can be measured using multi-neuron recordings and neuroimaging. Neural recordings reveal how the heterogeneity, sparseness, timing, and correlation of population activity shape information processing in local networks, whereas neuroimaging shows how long-range coupling and brain states impact on local activity and perception. To obtain an integrated perspective on neural information processing we need to combine knowledge from both levels of investigation. We review recent progress of how neural recordings, neuroimaging, and computational approaches begin to elucidate how interactions between local neural population activity and large-scale dynamics shape the structure and coding capacity of local information representations, make them state-dependent, and control distributed populations that collectively shape behavior

    Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

    Full text link
    Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding

    Idealized computational models for auditory receptive fields

    Full text link
    This paper presents a theory by which idealized models of auditory receptive fields can be derived in a principled axiomatic manner, from a set of structural properties to enable invariance of receptive field responses under natural sound transformations and ensure internal consistency between spectro-temporal receptive fields at different temporal and spectral scales. For defining a time-frequency transformation of a purely temporal sound signal, it is shown that the framework allows for a new way of deriving the Gabor and Gammatone filters as well as a novel family of generalized Gammatone filters, with additional degrees of freedom to obtain different trade-offs between the spectral selectivity and the temporal delay of time-causal temporal window functions. When applied to the definition of a second-layer of receptive fields from a spectrogram, it is shown that the framework leads to two canonical families of spectro-temporal receptive fields, in terms of spectro-temporal derivatives of either spectro-temporal Gaussian kernels for non-causal time or the combination of a time-causal generalized Gammatone filter over the temporal domain and a Gaussian filter over the logspectral domain. For each filter family, the spectro-temporal receptive fields can be either separable over the time-frequency domain or be adapted to local glissando transformations that represent variations in logarithmic frequencies over time. Within each domain of either non-causal or time-causal time, these receptive field families are derived by uniqueness from the assumptions. It is demonstrated how the presented framework allows for computation of basic auditory features for audio processing and that it leads to predictions about auditory receptive fields with good qualitative similarity to biological receptive fields measured in the inferior colliculus (ICC) and primary auditory cortex (A1) of mammals.Comment: 55 pages, 22 figures, 3 table

    Histogram of gradients of Time-Frequency Representations for Audio scene detection

    Full text link
    This paper addresses the problem of audio scenes classification and contributes to the state of the art by proposing a novel feature. We build this feature by considering histogram of gradients (HOG) of time-frequency representation of an audio scene. Contrarily to classical audio features like MFCC, we make the hypothesis that histogram of gradients are able to encode some relevant informations in a time-frequency {representation:} namely, the local direction of variation (in time and frequency) of the signal spectral power. In addition, in order to gain more invariance and robustness, histogram of gradients are locally pooled. We have evaluated the relevance of {the novel feature} by comparing its performances with state-of-the-art competitors, on several datasets, including a novel one that we provide, as part of our contribution. This dataset, that we make publicly available, involves 1919 classes and contains about 900900 minutes of audio scene recording. We thus believe that it may be the next standard dataset for evaluating audio scene classification algorithms. Our comparison results clearly show that our HOG-based features outperform its competitor

    Environmental Sound Classification with Parallel Temporal-spectral Attention

    Full text link
    Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, temporal attention mechanisms have been used in CNN to capture the useful information from the relevant time frames for audio classification, especially for weakly labelled data where the onset and offset times of the sound events are not applied. In these methods, however, the inherent spectral characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands. Parallel branches are constructed to allow temporal attention and spectral attention to be applied respectively in order to mitigate interference from the segments without the presence of sound events. The experiments on three environmental sound classification (ESC) datasets and two acoustic scene classification (ASC) datasets show that our method improves the classification performance and also exhibits robustness to noise.Comment: submitted to INTERSPEECH202
    • …
    corecore