812 research outputs found

    Linking Speech Perception and Neurophysiology: Speech Decoding Guided by Cascaded Oscillators Locked to the Input Rhythm

    Get PDF
    The premise of this study is that current models of speech perception, which are driven by acoustic features alone, are incomplete, and that the role of decoding time during memory access must be incorporated to account for the patterns of observed recognition phenomena. It is postulated that decoding time is governed by a cascade of neuronal oscillators, which guide template-matching operations at a hierarchy of temporal scales. Cascaded cortical oscillations in the theta, beta, and gamma frequency bands are argued to be crucial for speech intelligibility. Intelligibility is high so long as these oscillations remain phase locked to the auditory input rhythm. A model (Tempo) is presented which is capable of emulating recent psychophysical data on the intelligibility of speech sentences as a function of “packaging” rate (Ghitza and Greenberg, 2009). The data show that intelligibility of speech that is time-compressed by a factor of 3 (i.e., a high syllabic rate) is poor (above 50% word error rate), but is substantially restored when the information stream is re-packaged by the insertion of silent gaps in between successive compressed-signal intervals – a counterintuitive finding, difficult to explain using classical models of speech perception, but emerging naturally from the Tempo architecture

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Brain connectivity analysis from EEG signals using stable phase-synchronized states during face perception tasks

    Get PDF
    This is the author accepted manuscript. The final version is available from Elsevier via the DOI in this recordDegree of phase synchronization between different Electroencephalogram (EEG) channels is known to be the manifestation of the underlying mechanism of information coupling between different brain regions. In this paper, we apply a continuous wavelet transform (CWT) based analysis technique on EEG data, captured during face perception tasks, to explore the temporal evolution of phase synchronization, from the onset of a stimulus. Our explorations show that there exists a small set (typically 3-5) of unique synchronized patterns or synchrostates, each of which are stable of the order of milliseconds. Particularly, in the beta (β) band, which has been reported to be associated with visual processing task, the number of such stable states has been found to be three consistently. During processing of the stimulus, the switching between these states occurs abruptly but the switching characteristic follows a well-behaved and repeatable sequence. This is observed in a single subject analysis as well as a multiple-subject group-analysis in adults during face perception. We also show that although these patterns remain topographically similar for the general category of face perception task, the sequence of their occurrence and their temporal stability varies markedly between different face perception scenarios (stimuli) indicating toward different dynamical characteristics for information processing, which is stimulus-specific in nature. Subsequently, we translated these stable states into brain complex networks and derived informative network measures for characterizing the degree of segregated processing and information integration in those synchrostates, leading to a new methodology for characterizing information processing in human brain. The proposed methodology of modeling the functional brain connectivity through the synchrostates may be viewed as a new way of quantitative characterization of the cognitive ability of the subject, stimuli and information integration/segregation capability.The work presented in this paper was supported by FP7 EU funded MICHELANGELO project, Grant Agreement #288241. Website: www.michelangelo-project.eu/

    Analysis of very low quality speech for mask-based enhancement

    Get PDF
    The complexity of the speech enhancement problem has motivated many different solutions. However, most techniques address situations in which the target speech is fully intelligible and the background noise energy is low in comparison with that of the speech. Thus while current enhancement algorithms can improve the perceived quality, the intelligibility of the speech is not increased significantly and may even be reduced. Recent research shows that intelligibility of very noisy speech can be improved by the use of a binary mask, in which a binary weight is applied to each time-frequency bin of the input spectrogram. There are several alternative goals for the binary mask estimator, based either on the Signal-to-Noise Ratio (SNR) of each time-frequency bin or on the speech signal characteristics alone. Our approach to the binary mask estimation problem aims to preserve the important speech cues independently of the noise present by identifying time-frequency regions that contain significant speech energy. The speech power spectrum varies greatly for different types of speech sound. The energy of voiced speech sounds is concentrated in the harmonics of the fundamental frequency while that of unvoiced sounds is, in contrast, distributed across a broad range of frequencies. To identify the presence of speech energy in a noisy speech signal we have therefore developed two detection algorithms. The first is a robust algorithm that identifies voiced speech segments and estimates their fundamental frequency. The second detects the presence of sibilants and estimates their energy distribution. In addition, we have developed a robust algorithm to estimate the active level of the speech. The outputs of these algorithms are combined with other features estimated from the noisy speech to form the input to a classifier which estimates a mask that accurately reflects the time-frequency distribution of speech energy even at low SNR levels. We evaluate a mask-based speech enhancer on a range of speech and noise signals and demonstrate a consistent increase in an objective intelligibility measure with respect to noisy speech.Open Acces

    A Musicological Analysis of Nature's Best

    Get PDF
    Academic research on New Zealand popular music has primarily been conducted from historical and cultural perspectives. While asking important questions, these sources have rarely engaged with the musical details of New Zealand popular music. This thesis is a musicological analysis of the 100 songs from the three Nature’s Best albums. The musical perspective complements the socio-cultural research on New Zealand popular music. The Nature’s Best project was instigated by Mike Chunn in 2001 to celebrate the 75th anniversary of the Australasian Performing Right Association (APRA). All songwriting members of APRA and 100 celebrities and critics were invited to vote for their ten favourite New Zealand popular songs. Fourmyula’s 1969 hit ‘Nature’ gained the most votes. The three Nature’s Best CDs ranked the top 100 songs. The albums were a commercial success upon release in 2002 and 2003. This thesis analyses the 100 songs with regards to eight musical parameters: harmony, melodic construction, form, beat, length, tempo, introductory hooks and instrumental solos. The analytical methods were drawn from classical and popular musicology. Interviews with twelve songwriters were also conducted to gain alternative viewpoints on the analysis. The 100 songs provide a sample of New Zealand popular music from 1970 until 2000; thus, the analysis is useful for addressing questions of New Zealand musical style and traits. The results suggest New Zealand songwriters follow fundamental principles of Anglo-American songwriting, such as arched and balanced melodies, and forms based on repeated and contrasting sections. The harmonic language is similar to international artists of the same period; however, it appears 1970s and 1980s songwriters were more adventurous in this area compared with their 1990s counterparts. The instrumental solos were notable for an anti-virtuosic trait. It is argued this feature mirrors aspects of New Zealand identity

    How do we think : Modeling Interactions of Perception and Memory

    Get PDF
    A model of artificial perception based on self-organizing data into hierarchical structures is generalized to abstract thinking. This approach is illustrated using a two-level perception model, which is justified theoretically and tested empirically. The model can be extended to an arbitrary number of levels, with abstract concepts being understood as patterns of stable relationships between data aggregates of high representation levels
    corecore