485 research outputs found

    Percepcijska utemeljenost kepstranih mjera udaljenosti za primjene u obradi govora

    Get PDF
    Currently, one of the most widely used distance measures in speech and speaker recognition is the Euclidean distance between mel frequency cepstral coefficients (MFCC). MFCCs are based on filter bank algorithm whose filters are equally spaced on a perceptually motivated mel frequency scale. The value of mel cepstral vector, as well as the properties of the corresponding cepstral distance, are determined by several parameters used in mel cepstral analysis. The aim of this work is to examine compatibility of MFCC measure with human perception for different values of parameters in the analysis. By analysing mel filter bank parameters it is found that filter bank with 24 bands, 220 mels bandwidth and band overlap coefficient equal and higher than one gives optimal spectral distortion (SD) distance measures. For this kind of mel filter bank, the difference between vowels can be recognised for full-length mel cepstral SD RMS measure higher than 0.4 - 0.5 dB. Further on, we will show that usage of truncated mel cepstral vector (12 coefficients) is justified for speech recognition, but may be arguable for speaker recognition. We also analysed the impact of aliasing in cepstral domain on cepstral distortion measures. The results showed high correlation of SD distances calculated from aperiodic and periodic mel cepstrum, leading to the conclusion that the impact of aliasing is generally minor. There are rare exceptions where aliasing is present, and these were also analysed.Jedna od danas najčešće korištenih mjera u automatskom prepoznavanju govora i govornika je mjera euklidske udaljenosti MFCC vektora. Algoritam za izračunavanje mel frekvencijskih kepstralnih koeficijenata zasniva se na filtarskom slogu kod kojeg su pojasi ekvidistantno raspoređeni na percepcijski motiviranoj mel skali. Na vrijednost mel kepstralnog vektora, a samim time i na svojstva kepstralne mjere udaljenosti glasova, utječe veći broj parametara sustava za kepstralnu analizu. Tema ovog rada je ispitati usklađenost MFCC mjere sa stvarnim percepcijskim razlikama za različite vrijednosti parametara analize. Analizom parametara mel filtarskog sloga utvrdili smo da filtar sa 24 pojasa, širine 220 mel-a i faktorom preklapanja filtra većim ili jednakim jedan, daje optimalne SD mjere koje se najbolje slažu s percepcijom. Za takav mel filtarski slog granica čujnosti razlike između glasova je 0.4-0.5 dB, mjereno SD RMS razlikom potpunih mel kepstralnih vektora. Također, pokazat ćemo da je korištenje mel kepstralnog vektora odrezanog na konačnu dužinu (12 koeficijenata) opravdano za prepoznavanje govora, ali da bi moglo biti upitno u primjenama prepoznavanja govornika. Analizirali smo i utjecaj preklapanja spektara u kepstralnoj domeni na mjere udaljenosti glasova. Utvrđena je izrazita koreliranost SD razlika izračunatih iz aperiodskog i periodičkog mel kepstra iz čega zaključujemo da je utjecaj preklapanja spektara generalno zanemariv. Postoje rijetke iznimke kod kojih je utjecaj preklapanja spektara prisutan, te su one posebno analizirane

    Low-delay nonuniform pseudo-QMF banks with application to speech enhancement

    Get PDF
    Journal ArticleAbstract-This paper presents a method for designing low-delay nonuniform pseudo quadrature mirror filter (QMF) banks. This method is motivated by the work of Li, Nguyen, and Tantaratana, in which the nonuniform filter bank is realized by combining an appropriate number of adjacent sub-bands of a uniform pseudo-QMF bank. In prior work, the prototype filter of the uniform pseudo-QMF bank was constrained to have linear phase and the overall delay associated with the filter bank was often unacceptably large for filter banks with a large number of sub-bands. This paper proposes a pseudo-QMF filter bank design technique that significantly reduces the delay by relaxing the linear phase constraints. An example in which an oversampled critical-band nonuniform filter bank is designed and applied to a two-state modeling speech enhancement system is presented in this paper. Comparison of the performance of this system to competing methods employing tree-structured, linear phase multiresolution analysis indicates that the approach described in this paper strikes a good balance between system performance and low delay

    Sparse representation for audio noise removal using zero-zone quantizers

    Get PDF
    In zero zone quantization, bins around zero are quantized to a zero value. This kind of quantization can be applied on orthogonal transforms to remove the unwanted or redundant signal. Transforms reveal structures and properties of a signal and hence careful application of a zero zone over the transform coefficients leads to noise removal. In this thesis, such quantizers are applied over Discrete Fourier Transform and Karhunen Loeve Transform coefficients separately, and outputs compared. Further, the localization of the zero zones to certain frequencies leads to better performance in terms of noise removal. PEAQ (Perceptual Evaluation of Audio Quality) scores have been used to measure the objective quality of the denoised signal

    Real-time Digital Simulation of Guitar Amplifiers as Audio Effects

    Get PDF
    Práce se zabývá číslicovou simulací kytarových zesilovačů, jakož to nelineárních analogových hudebních efektů, v reálném čase. Hlavním cílem práce je návrh algoritmů, které by umožnily simulaci složitých systémů v reálném čase. Tyto algoritmy jsou prevážně založeny na automatizované DK-metodě a aproximaci nelineárních funkcí. Kvalita navržených algoritmů je stanovana pomocí poslechových testů.The work deals with the real-time digital simulation of guitar amplifiers considered as nonlinear analog audio effects. The main aim is to design algorithms which are able to simulate complex systems in real-time. These algorithms are mainly based on the automated DK-method and the approximation of nonlinear functions. Quality of the designed algorithms is evaluated using listening tests.

    Audio Coding Based on Integer Transforms

    Get PDF
    Die Audiocodierung hat sich in den letzten Jahren zu einem sehr populären Forschungs- und Anwendungsgebiet entwickelt. Insbesondere gehörangepasste Verfahren zur Audiocodierung, wie etwa MPEG-1 Layer-3 (MP3) oder MPEG-2 Advanced Audio Coding (AAC), werden häufig zur effizienten Speicherung und Übertragung von Audiosignalen verwendet. Für professionelle Anwendungen, wie etwa die Archivierung und Übertragung im Studiobereich, ist hingegen eher eine verlustlose Audiocodierung angebracht. Die bisherigen Ansätze für gehörangepasste und verlustlose Audiocodierung sind technisch völlig verschieden. Moderne gehörangepasste Audiocoder basieren meist auf Filterbänken, wie etwa der überlappenden orthogonalen Transformation "Modifizierte Diskrete Cosinus-Transformation" (MDCT). Verlustlose Audiocoder hingegen verwenden meist prädiktive Codierung zur Redundanzreduktion. Nur wenige Ansätze zur transformationsbasierten verlustlosen Audiocodierung wurden bisher versucht. Diese Arbeit präsentiert einen neuen Ansatz hierzu, der das Lifting-Schema auf die in der gehörangepassten Audiocodierung verwendeten überlappenden Transformationen anwendet. Dies ermöglicht eine invertierbare Integer-Approximation der ursprünglichen Transformation, z.B. die IntMDCT als Integer-Approximation der MDCT. Die selbe Technik kann auch für Filterbänke mit niedriger Systemverzögerung angewandt werden. Weiterhin ermöglichen ein neuer, mehrdimensionaler Lifting-Ansatz und eine Technik zur Spektralformung von Quantisierungsfehlern eine Verbesserung der Approximation der ursprünglichen Transformation. Basierend auf diesen neuen Integer-Transformationen werden in dieser Arbeit neue Verfahren zur Audiocodierung vorgestellt. Die Verfahren umfassen verlustlose Audiocodierung, eine skalierbare verlustlose Erweiterung eines gehörangepassten Audiocoders und einen integrierten Ansatz zur fein skalierbaren gehörangepassten und verlustlosen Audiocodierung. Schließlich wird mit Hilfe der Integer-Transformationen ein neuer Ansatz zur unhörbaren Einbettung von Daten mit hohen Datenraten in unkomprimierte Audiosignale vorgestellt.In recent years audio coding has become a very popular field for research and applications. Especially perceptual audio coding schemes, such as MPEG-1 Layer-3 (MP3) and MPEG-2 Advanced Audio Coding (AAC), are widely used for efficient storage and transmission of music signals. Nevertheless, for professional applications, such as archiving and transmission in studio environments, lossless audio coding schemes are considered more appropriate. Traditionally, the technical approaches used in perceptual and lossless audio coding have been separate worlds. In perceptual audio coding, the use of filter banks, such as the lapped orthogonal transform "Modified Discrete Cosine Transform" (MDCT), has been the approach of choice being used by many state of the art coding schemes. On the other hand, lossless audio coding schemes mostly employ predictive coding of waveforms to remove redundancy. Only few attempts have been made so far to use transform coding for the purpose of lossless audio coding. This work presents a new approach of applying the lifting scheme to lapped transforms used in perceptual audio coding. This allows for an invertible integer-to-integer approximation of the original transform, e.g. the IntMDCT as an integer approximation of the MDCT. The same technique can also be applied to low-delay filter banks. A generalized, multi-dimensional lifting approach and a noise-shaping technique are introduced, allowing to further optimize the accuracy of the approximation to the original transform. Based on these new integer transforms, this work presents new audio coding schemes and applications. The audio coding applications cover lossless audio coding, scalable lossless enhancement of a perceptual audio coder and fine-grain scalable perceptual and lossless audio coding. Finally an approach to data hiding with high data rates in uncompressed audio signals based on integer transforms is described

    Parametric spatial audio processing utilising compact microphone arrays

    Get PDF
    This dissertation focuses on the development of novel parametric spatial audio techniques using compact microphone arrays. Compact arrays are of special interest since they can be adapted to fit in portable devices, opening the possibility of exploiting the potential of immersive spatial audio algorithms in our daily lives. The techniques developed in this thesis consider the use of signal processing algorithms adapted for human listeners, thus exploiting the capabilities and limitations of human spatial hearing. The findings of this research are in the following three areas of spatial audio processing: directional filtering, spatial audio reproduction, and direction of arrival estimation.  In directional filtering, two novel algorithms have been developed based on the cross-pattern coherence (CroPaC). The method essentially exploits the directional response of two different types of beamformers by using their cross-spectrum to estimate a soft masker. The soft masker provides a probability-like parameter that indicates whether there is sound present in specific directions. It is then used as a post-filter to provide further suppression of directionally distributed noise at the output of a beamformer. The performance of these algorithms represent a significant improvement over previous state-of-the-art methods.  In parametric spatial audio reproduction, an algorithm is developed for multi-channel loudspeaker and headphone rendering. Current limitations in spatial audio reproduction are related to high inter-channel coherence between the channels, which is common in signal-independent systems, or time-frequency artefacts in parametric systems. The developed algorithm focuses on solving these limitations by utilising two sets of beamformers. The first set of beamformers, namely analysis beamformers, is used to estimate a set of perceptually-relevant sound-field parameters, such as the separate channel energies, inter-channel time differences and inter-channel coherences of the target-output-setup signals. The directionality of the analysis beamformers is defined so that it follows that of typical loudspeaker panning functions and, for headphone reproduction, that of the head-related transfer functions (HRTFs). The directionality of the second set of high audio quality beamformers is then enhanced with the parametric information derived from the analysis beamformers. Listening tests confirm the perceptual benefit of such type of processing. In direction of arrival (DOA) estimation, histogram analysis of beamforming and active intensity based DOA estimators has been proposed. Numerical simulations and experiments with prototype and commercial microphone arrays show that the accuracy of DOA estimation is improved

    Audio Coding Based on Integer Transforms

    Get PDF
    Die Audiocodierung hat sich in den letzten Jahren zu einem sehr populären Forschungs- und Anwendungsgebiet entwickelt. Insbesondere gehörangepasste Verfahren zur Audiocodierung, wie etwa MPEG-1 Layer-3 (MP3) oder MPEG-2 Advanced Audio Coding (AAC), werden häufig zur effizienten Speicherung und Übertragung von Audiosignalen verwendet. Für professionelle Anwendungen, wie etwa die Archivierung und Übertragung im Studiobereich, ist hingegen eher eine verlustlose Audiocodierung angebracht. Die bisherigen Ansätze für gehörangepasste und verlustlose Audiocodierung sind technisch völlig verschieden. Moderne gehörangepasste Audiocoder basieren meist auf Filterbänken, wie etwa der überlappenden orthogonalen Transformation "Modifizierte Diskrete Cosinus-Transformation" (MDCT). Verlustlose Audiocoder hingegen verwenden meist prädiktive Codierung zur Redundanzreduktion. Nur wenige Ansätze zur transformationsbasierten verlustlosen Audiocodierung wurden bisher versucht. Diese Arbeit präsentiert einen neuen Ansatz hierzu, der das Lifting-Schema auf die in der gehörangepassten Audiocodierung verwendeten überlappenden Transformationen anwendet. Dies ermöglicht eine invertierbare Integer-Approximation der ursprünglichen Transformation, z.B. die IntMDCT als Integer-Approximation der MDCT. Die selbe Technik kann auch für Filterbänke mit niedriger Systemverzögerung angewandt werden. Weiterhin ermöglichen ein neuer, mehrdimensionaler Lifting-Ansatz und eine Technik zur Spektralformung von Quantisierungsfehlern eine Verbesserung der Approximation der ursprünglichen Transformation. Basierend auf diesen neuen Integer-Transformationen werden in dieser Arbeit neue Verfahren zur Audiocodierung vorgestellt. Die Verfahren umfassen verlustlose Audiocodierung, eine skalierbare verlustlose Erweiterung eines gehörangepassten Audiocoders und einen integrierten Ansatz zur fein skalierbaren gehörangepassten und verlustlosen Audiocodierung. Schließlich wird mit Hilfe der Integer-Transformationen ein neuer Ansatz zur unhörbaren Einbettung von Daten mit hohen Datenraten in unkomprimierte Audiosignale vorgestellt.In recent years audio coding has become a very popular field for research and applications. Especially perceptual audio coding schemes, such as MPEG-1 Layer-3 (MP3) and MPEG-2 Advanced Audio Coding (AAC), are widely used for efficient storage and transmission of music signals. Nevertheless, for professional applications, such as archiving and transmission in studio environments, lossless audio coding schemes are considered more appropriate. Traditionally, the technical approaches used in perceptual and lossless audio coding have been separate worlds. In perceptual audio coding, the use of filter banks, such as the lapped orthogonal transform "Modified Discrete Cosine Transform" (MDCT), has been the approach of choice being used by many state of the art coding schemes. On the other hand, lossless audio coding schemes mostly employ predictive coding of waveforms to remove redundancy. Only few attempts have been made so far to use transform coding for the purpose of lossless audio coding. This work presents a new approach of applying the lifting scheme to lapped transforms used in perceptual audio coding. This allows for an invertible integer-to-integer approximation of the original transform, e.g. the IntMDCT as an integer approximation of the MDCT. The same technique can also be applied to low-delay filter banks. A generalized, multi-dimensional lifting approach and a noise-shaping technique are introduced, allowing to further optimize the accuracy of the approximation to the original transform. Based on these new integer transforms, this work presents new audio coding schemes and applications. The audio coding applications cover lossless audio coding, scalable lossless enhancement of a perceptual audio coder and fine-grain scalable perceptual and lossless audio coding. Finally an approach to data hiding with high data rates in uncompressed audio signals based on integer transforms is described
    corecore