Search CORE

36 research outputs found

Äänisisällön automaattisen luokittelun menetelmiä

Author: Pohjalainen Jouni
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2007
Field of study

This study presents an overview of different methods of digital signal processing and pattern recognition that are frequently applicable to automatic recognition, classification and description of audio content. Moreover, strategies for the combination of the said methods are discussed. Some of the published practical applications from different areas are cited to illustrate the use of the basic methods and the combined recognition strategies. A brief overview of human auditory perception is also given, with emphasis on the aspects that are important for audio recognition.Tässä työssä esitetään yleiskatsaus sellaisiin signaalinkäsittelyn ja hahmontunnistuksen menetelmiin, jotka ovat usein sovellettavissa äänisisällön automaattiseen tunnistamiseen, luokitteluun ja kuvaamiseen. Lisäksi työssä esitetään strategioita mainittujen menetelmien yhdistelyyn ja annetaan näihin ratkaisuihin liittyviä esimerkinomaisia viitteitä kirjallisuudesta löytyviin käytännön sovelluksiin eri sovellusalueilta. Työ sisältää myös suppean esityksen ihmisen kuulon toiminnan pääpiirteistä äänitunnistuksen kannalta

Aaltodoc Publication Archive

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Crossref

Automatic Detection of High Vocal Effort in Telephone Speech

Author: Alku Paavo
Pohjalainen Jouni
Pulakka Hannu
Raitio Tuomo
Publication venue
Publication date: 01/01/2012
Field of study

Edinburgh Research Explorer

Detection of shouted speech in noise:Human and machine

Author: Alku Paavo
Pohjalainen Jouni
Raitio Tuomo
Yrttiaho Santeri
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2013
Field of study

Crossref

Edinburgh Research Explorer

Improved formant frequency estimation from high-pitched vowels by downgrading the contribution of the glottal source with weighted linear prediction

Author: Alku Paavo
Laukkanen Anne-Maria
Pohjalainen Jouni
Story Brad H.
Vainio Martti
Publication venue
Publication date: 01/01/2012
Field of study

Edinburgh Research Explorer

Analysis and synthesis of shouted speech

Author: Airaksinen Manu
Alku Paavo
Pohjalainen Jouni
Raitio Tuomo
Suni Antti
Vainio Martti
Publication venue
Publication date: 01/01/2013
Field of study

Edinburgh Research Explorer

Comparing spectrum estimators in speaker verification under additive noise degradation

Author: Alku Paavo
Hansson-Sandsten Maria
Kinnunen Tomi H.
Pohjalainen Jouni
Saeidi Rahim
Sandberg Johan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Bu çalışma, 25-30 Mart 2012 tarihleri arasında Kyoto[Japonya]’da düzenlenen IEEE International Conference on Acoustics, Speech and Signal Processing’da bildiri olarak sunulmuştur.Different short-term spectrum estimators for speaker verification under additive noise are considered. Conventionally, mel-frequency cepstral coefficients (MFCCs) are computed from discrete Fourier transform (DFT) spectra of windowed speech frames. Recently, linear prediction (LP) and its temporally weighted variants have been substituted as the spectrum analysis method in speech and speaker recognition. In this paper, 12 different short-term spectrum estimation methods are compared for speaker verification under additive noise contamination. Experimental results conducted on NIST 2002 SRE show that the spectrum estimation method has a large effect on recognition performance and stabilized weighted LP (SWLP) and minimum variance distortionless response (MVDR) methods yield approximately 7 % and 8 % relative improvements over the standard DFT method at -10 dB SNR level of factory and babble noises, respectively in terms of equal error rate (EER).Inst Elect & Elect Engineers, Signal Processing SocIEE

Açık Erişim@BUU

Using group delay functions from all-pole models for speaker recognition

Author: Alku Paavo
Bimbot F.
Cerisara C.
Fougeron C.
Gravier G.
Kinnunen Tomi H.
Lamel L.
Pellegrino F.
Perrier P.
Pohjalainen Jouni
Rajan Padmanabhan
Publication venue: Isc-Int Speech Communication Association
Publication date: 01/01/2013
Field of study

Bu çalışma, 25-29 Ağustos 2013 tarihlerinde Lyon[Fransa]'da düzenlenen 14. Annual Conference of the International Speech Communication Association [Interspeech 2013]'da bildiri olarak sunulmuştur.Popular features for speech processing, such as mel-frequency cepstral coefficients (MFCCs), are derived from the short-term magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phase-deaf, phase-based features have remained less explored due to additional signal processing difficulties they introduce. A useful representation of the phase is the group delay function, but its robust computation remains difficult. This paper advocates the use of group delay functions derived from parametric all-pole models instead of their direct computation from the discrete Fourier transform. Using a subset of the vocal effort data in the NIST 2010 speaker recognition evaluation (SRE) corpus, we show that group delay features derived via parametric all-pole models improve recognition accuracy, especially under high vocal effort. Additionally, the group delay features provide comparable or improved accuracy over conventional magnitude-based MFCC features. Thus, the use of group delay functions derived from all-pole models provide an effective way to utilize information from the phase spectrum of speech signals.Academy of Finland (253120)Int Speech Commun AssociationAmazonMicrosoftGoogleTcL SYTRALEuropean Language Resources AssociationOuaeroImaginoveVOCAPIA ResearchAcapelaSpeech OceanALDEBARANOrangeVecsysIBM ResearchRaytheon BBN TechnologyVoxyge

Açık Erişim@BUU