247,548 research outputs found

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Independent Educational Evaluations as Issues of Dispute in Special Education Due Process Hearings

    Get PDF
    This study examined the pertinent details and outcomes of special education due process hearings (n = 100) that addressed independent educational evaluations as an issue of dispute in a 14-state sample. Variables related to the frequency of these cases, the characteristics of students involved, the specific types of IEEs requested, and the other related issues and outcomes were coded and analyzed. Psycho-educational evaluations were addressed in the most due process hearings, followed by speech-language evaluations, and neuro-psychological evaluations. Statistically significant associations were identified between states regarding a) the extent to which IEEs are issues of dispute in due process hearings, b) the prevailing parties in these hearings, and c) the types of legal representation used by parents. Recommendations for policy, practice, and additional research related to IEEs and special education due process hearings are discussed

    Acoustic signal processing based on the short-time spectrum

    Get PDF
    technical reportThe frequency domain representation of a time signal afforded by the Fourier transform is a powerful tool in acoustic signal processing. The usefulness of this representation is rooted in the mechanisms of sound production and perception. Many sources of sound exhibit normal modes or natural frequencies of vibration, and can be described concisely in the frequency domain. The human auditory system performs frequency analysis early in the hearing process, so perception is often best described by frequency domain parameters. This dissertation investigates a new approach to acoustic signal processing based on the short-time fourier transform, a two dimensional representation which shows the time and frequency structure of sounds. This representation is appropriate for signals such as speech and music. Where the natural frequencies of the source change and timing of these changes is important to perception. The principal advantage of this approach is that the signal processing domain is similar to the perceptual domain, so that signal modifications can be related to perceptual criteria. The mathematical basis for this type of processing is developed, and four examples are described: removal of broad band background noise, isolation of perceptually important speech features, dynamic range compression and expansion, and removal of locally periodic interfering signals

    The effects of musical training on perception and neural representation of temporal fine structure

    Get PDF
    One of the most common complaints of persons with sensorineural hearing loss is difficulty hearing in background noise. Temporal fine structure (TFS) is one of the factors that contributes to understanding speech in the presence of background noise. TFS refers to the periodic information in speech which helps us to identify which speech sound we are listening to. TFS is also negatively affected by hearing loss, as well as age. In a quest to discover how TFS processing and thus speech-in-noise understanding can be improved, this study examined the effects of musical training on behavioral and physiological measures of temporal fine structure, as well as the brain-behavior relationship as it relates to frequency representation in the brainstem. This relationship was measured by two behavioral tests: frequency discrimination and a measure of speech understanding in background noise – the Hearing-in-Noise test (HINT), and one physiologic measure, the frequency following response (FFR). The stimuli for frequency discrimination and the FFR were tonebursts of 500 Hz in quiet, 1000 Hz in quiet, 500 Hz in noise, and 1000 Hz in noise. A total of 28 subjects were tested, 16 musicians and 12 non-musicians. The results showed that musicians had better frequency difference limens (FDLs) than non-musicians. For the physiologic measure, musical experience did not affect phase-locked representations of TFS. Musicians also did not have better signal-to-noise ratios on the HINT. There were no significant brain-behavior relationships between measures except that lower or better FDL thresholds at 1000 Hz in quiet implied lower or worse phase coherence at 1000 Hz in quiet. A greater number of years of musical experience related to lower or better FDLs for the conditions in quiet but not in noise. The years of training did not relate to performance on FFR phase coherence, amplitude, or HINT scores. It was concluded that musical training significantly enhanced behavioral TFS processing, however no significant effects were noted for neural representation of TFS or speech-in-noise understanding

    The neural representation of frequency in quiet and noise across the adult life span

    Get PDF
    The purpose of the present study was to examine why older adults have trouble with speech-in-noise understanding. Difficulty with speech-in-noise comprehension has been associated with age-related degradation in frequency processing. Our study sought to investigate this relationship by examining the neural representation of frequency in quiet and in noise across the adult-life span. In order to do this, one behavioral correlate of frequency processing, frequency difference limens (FDLs), and one electrophysiological correlate, the frequency following response (FFR), was utilized. In the present study, we specifically focus on the electrophysiological measures of frequency processing across the adult life span. It was hypothesized that as age increased, FFR phase coherence and FFR amplitude would decrease (i.e. neural synchrony was expected to degrade with age). It was also hypothesized that masking noise would be expected to have an adverse effect on both FFR phase coherence and amplitude, with older adults having more adverse effects than the younger adults. Properly identifying the underlying source(s) of impairment is essential to designing appropriate treatment plans that effectively target these underlying deficits. Thus, the present study aims to determine how frequency processing is affected by aging and what consequences it may have on speech-in-noise understanding in older adults

    Neural encoding of the speech envelope by children with developmental dyslexia.

    Get PDF
    Developmental dyslexia is consistently associated with difficulties in processing phonology (linguistic sound structure) across languages. One view is that dyslexia is characterised by a cognitive impairment in the "phonological representation" of word forms, which arises long before the child presents with a reading problem. Here we investigate a possible neural basis for developmental phonological impairments. We assess the neural quality of speech encoding in children with dyslexia by measuring the accuracy of low-frequency speech envelope encoding using EEG. We tested children with dyslexia and chronological age-matched (CA) and reading-level matched (RL) younger children. Participants listened to semantically-unpredictable sentences in a word report task. The sentences were noise-vocoded to increase reliance on envelope cues. Envelope reconstruction for envelopes between 0 and 10Hz showed that the children with dyslexia had significantly poorer speech encoding in the 0-2Hz band compared to both CA and RL controls. These data suggest that impaired neural encoding of low frequency speech envelopes, related to speech prosody, may underpin the phonological deficit that causes dyslexia across languages.Medical Research Council (Grant ID: G0902375)This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.bandl.2016.06.00

    Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

    Full text link
    Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within 1%1\% of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.Comment: Preprint. Submitted to ICASSP 202
    • …
    corecore