637 research outputs found

    A Generative Product-of-Filters Model of Audio

    Full text link
    We propose the product-of-filters (PoF) model, a generative model that decomposes audio spectra as sparse linear combinations of "filters" in the log-spectral domain. PoF makes similar assumptions to those used in the classic homomorphic filtering approach to signal processing, but replaces hand-designed decompositions built of basic signal processing operations with a learned decomposition based on statistical inference. This paper formulates the PoF model and derives a mean-field method for posterior inference and a variational EM algorithm to estimate the model's free parameters. We demonstrate PoF's potential for audio processing on a bandwidth expansion task, and show that PoF can serve as an effective unsupervised feature extractor for a speaker identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod

    A kepstrum approach to filtering, smoothing and prediction

    Get PDF
    The kepstrum (or complex cepstrum) method is revisited and applied to the problem of spectral factorization where the spectrum is directly estimated from observations. The solution to this problem in turn leads to a new approach to optimal filtering, smoothing and prediction using the Wiener theory. Unlike previous approaches to adaptive and self-tuning filtering, the technique, when implemented, does not require a priori information on the type or order of the signal generating model. And unlike other approaches - with the exception of spectral subtraction - no state-space or polynomial model is necessary. In this first paper results are restricted to stationary signal and additive white noise

    Development of rotorcraft interior. Noise control concepts. Phase 1: Definition study

    Get PDF
    A description of helicopter noise, diagnostic techniques for source and path identification, an interior noise prediction model, and a measurement program for model validation are provided

    Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

    Full text link
    Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.Comment: Accepted at Interspeech 202

    Robust Auditory-Based Speech Processing Using the Average Localized Synchrony Detection

    Get PDF
    In this paper, a new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response’s spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise

    Speech Processing in Computer Vision Applications

    Get PDF
    Deep learning has been recently proven to be a viable asset in determining features in the field of Speech Analysis. Deep learning methods like Convolutional Neural Networks facilitate the expansion of specific feature information in waveforms, allowing networks to create more feature dense representations of data. Our work attempts to address the problem of re-creating a face given a speaker\u27s voice and speaker identification using deep learning methods. In this work, we first review the fundamental background in speech processing and its related applications. Then we introduce novel deep learning-based methods to speech feature analysis. Finally, we will present our deep learning approaches to speaker identification and speech to face synthesis. The presented method can convert a speaker audio sample to an image of their predicted face. This framework is composed of several chained together networks, each with an essential step in the conversion process. These include Audio embedding, encoding, and face generation networks, respectively. Our experiments show that certain features can map to the face and that with a speaker\u27s voice, DNNs can create their face and that a GUI could be used in conjunction to display a speaker recognition network\u27s data

    Analysis of nonmodal glottal event patterns with application to automatic speaker recognition

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2008.Includes bibliographical references (p. 211-215).Regions of phonation exhibiting nonmodal characteristics are likely to contain information about speaker identity, language, dialect, and vocal-fold health. As a basis for testing such dependencies, we develop a representation of patterns in the relative timing and height of nonmodal glottal pulses. To extract the timing and height of candidate pulses, we investigate a variety of inverse-filtering schemes including maximum-entropy deconvolution that minimizes predictability of a signal and minimum-entropy deconvolution that maximizes pulse-likeness. Hybrid formulations of these methods are also considered. we then derive a theoretical framework for understanding frequency- and time-domain properties of a pulse sequence, a process that sheds light on the transformation of nonmodal pulse trains into useful parameters. In the frequency domain, we introduce the first comprehensive mathematical derivation of the effect of deterministic and stochastic source perturbation on the short-time spectrum. We also propose a pitch representation of nonmodality that provides an alternative viewpoint on the frequency content that does not rely on Fourier bases. In developing time-domain properties, we use projected low-dimensional histograms of feature vectors derived from pulse timing and height parameters. For these features, we have found clusters of distinct pulse patterns, reflecting a wide variety of glottal-pulse phenomena including near-modal phonation, shimmer and jitter, diplophonia and triplophonia, and aperiodicity. Using temporal relationships between successive feature vectors, an algorithm by which to separate these different classes of glottal-pulse characteristics has also been developed.(cont.) We have used our glottal-pulse-pattern representation to automatically test for one signal dependency: speaker dependence of glottal-pulse sequences. This choice is motivated by differences observed between talkers in our separated feature space. Using an automatic speaker verification experiment, we investigate tradeoffs in speaker dependency for short-time pulse patterns, reflecting local irregularity, as well as long-time patterns related to higher-level cyclic variations. Results, using speakers with a broad array of modal and nonmodal behaviors, indicate a high accuracy in speaker recognition performance, complementary to the use of conventional mel-cepstral features. These results suggest that there is rich structure to the source excitation that provides information about a particular speaker's identity.by Nicolas Malyska.Ph.D

    Noise-Robust Voice Conversion

    Get PDF
    A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method

    Wavelet-based techniques for speech recognition

    Get PDF
    In this thesis, new wavelet-based techniques have been developed for the extraction of features from speech signals for the purpose of automatic speech recognition (ASR). One of the advantages of the wavelet transform over the short time Fourier transform (STFT) is its capability to process non-stationary signals. Since speech signals are not strictly stationary the wavelet transform is a better choice for time-frequency transformation of these signals. In addition it has compactly supported basis functions, thereby reducing the amount of computation as opposed to STFT where an overlapping window is needed. [Continues.
    • …
    corecore