Search CORE

17,436 research outputs found

BaNa: a noise resilient fundamental frequency detection algorithm for speech and music

Author: Ba He
Cai Weiyang
Heinzelman Wendi
Seyfettin Demirkol Ilker
Yang Na
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Fundamental frequency (F0) is one of the essential features in many acoustic related applications. Although numerous F0 detection algorithms have been developed, the detection accuracy in noisy environments still needs improvement. We present a hybrid noise resilient F0 detection algorithm named BaNa that combines the approaches of harmonic ratios and Cepstrum analysis. A Viterbi algorithm with a cost function is used to identify the F0 value among several F0 candidates. Speech and music databases with eight different types of additive noise are used to evaluate the performance of the BaNa algorithm and several classic and state-of-the-art F0 detection algorithms. Results show that for almost all types of noise and signal-to-noise ratio (SNR) values investigated, BaNa achieves the lowest Gross Pitch Error (GPE) rate among all the algorithms. Moreover, for the 0 dB SNR scenarios, the BaNa algorithm is shown to achieve 20% to 35% GPE rate for speech and 12% to 39% GPE rate for music. We also describe implementation issues that must be addressed to run the BaNa algorithm as a real-time application on a smartphone platform.Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context

Author: Johnson Michael T
Liu Jia
Xia Shanhong
Yang Hua
Zhang Wei-Qiang
Zhao Junhong
Publication venue: e-Publications@Marquette
Publication date: 01/12/2013
Field of study

Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection

epublications@Marquette

Springer - Publisher Connector

Automatic Measurement of Pre-aspiration

Author: Adi Yossi
Hejná Míša
Keshet Joseph
Sheena Yaniv
Publication venue
Publication date: 15/06/2017
Field of study

Pre-aspiration is defined as the period of glottal friction occurring in sequences of vocalic/consonantal sonorants and phonetically voiceless obstruents. We propose two machine learning methods for automatic measurement of pre-aspiration duration: a feedforward neural network, which works at the frame level; and a structured prediction model, which relies on manually designed feature functions, and works at the segment level. The input for both algorithms is a speech signal of an arbitrary length containing a single obstruent, and the output is a pair of times which constitutes the pre-aspiration boundaries. We train both models on a set of manually annotated examples. Results suggest that the structured model is superior to the frame-based model as it yields higher accuracy in predicting the boundaries and generalizes to new speakers and new languages. Finally, we demonstrate the applicability of our structured prediction algorithm by replicating linguistic analysis of pre-aspiration in Aberystwyth English with high correlation

arXiv.org e-Print Archive

Crossref

Jitter and Shimmer measurements for speaker diarization

Author: Hernando Pericás Francisco Javier
Luque Jordi
Zewoudie Abraham Woubie
Publication venue
Publication date: 01/01/2014
Field of study

Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Teaching Tonal Discrimination Based on Statistical Properties and Acoustical Characteristics of the Chinese Four Tones : With Regard to the Contrast between Tone-2 and Tone-3

Author: Yang Liming
今泉一哉
比企静雄
砂岡和子
Publication venue: IWLeL 2004 Program Committee
Publication date: 31/03/2005
Field of study

Based on the statistical properties of occurrence frequency and transition probability of the Chinese four tones and the acoustical characteristics of their phonatory control and perceptual response, a Computer-Assisted Instruction system for teaching tonal discrimination to beginners was designed, with special regard to the contrast between Tone-2 and Tone-3. In order to emphasize the difference from the Japanese word accent, intermediate tonal stimulus with synthetic speech and visual display of change in voice pitch were utilized in practicing perception, and the efficiency was improved by introducing the CAI algorithm. In practicing phonation, the target voice pitch patterns were adapted to the talker\u27s voice register which is derived from measurement of fundamental frequency in speaking, and indicated on the screen. The possibility of transferring the system to a self-learning program through the internet was also discussed

Waseda University Repository

Jointly Tracking and Separating Speech Sources Using Multiple Features and the generalized labeled multi-Bernoulli Framework

Author: Lin Shoufeng
Publication venue
Publication date: 16/04/2018
Field of study

This paper proposes a novel joint multi-speaker tracking-and-separation method based on the generalized labeled multi-Bernoulli (GLMB) multi-target tracking filter, using sound mixtures recorded by microphones. Standard multi-speaker tracking algorithms usually only track speaker locations, and ambiguity occurs when speakers are spatially close. The proposed multi-feature GLMB tracking filter treats the set of vectors of associated speaker features (location, pitch and sound) as the multi-target multi-feature observation, characterizes transitioning features with corresponding transition models and overall likelihood function, thus jointly tracks and separates each multi-feature speaker, and addresses the spatial ambiguity problem. Numerical evaluation verifies that the proposed method can correctly track locations of multiple speakers and meanwhile separate speech signals

arXiv.org e-Print Archive

Crossref

Mechanical and durability performance of lightweight concrete brick with palm oil fuel ash (POFA)

Author: Adnan Suraya Hani
Ayop Sallehuddin Shah
Mohd Yassin Nurain Izzati
Shahidan Shahiron
Publication venue: 'Penerbit UTHM'
Publication date: 01/01/2020
Field of study

Lightweight building materials such as precast roof and wall panel has been widely used in the construction industries. This is because lightweight materials could benefits the economy and society in terms of manufacturing, transportation and handling cost. One of the most preferable lightweight material is Expanded Polystyrene (EPS). EPS consist of 98% of air and 2% of polystyrene. Therefore, EPS is very low in density which could contribute in the reduction of building materials mass. Abundance of studies has shown that EPS has significantly contribute to the reduction of brick density. EPS has been used as the aggregates replacement in concrete. However, the existing of EPS in the concrete has reduce the strength performance of the concrete. Due to this, researchers have extend their research in improvising the EPS concrete and brick strength with the addition of pozzolanic materials such as fly ash, rice husk ask, silica fume and etc [1-4]. The ability of these pozzolanic materials in enhancing the strength of brick or concrete has been proven..

UTHM Institutional Repository

Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

Author: Ben Milner
Boll
Chen
Deller
Ephraim
Ephraim
Ephraim
Ephraim
Esfandiar Zavarehei
Friedman
Griffin
Hansen
Ioannis Andrianakis
Jonathan Darch
Kalman
Lim
Lim
Paul White
Qin Yan
Rentzos
Saeed Vaseghi
Sameti
Secrest
Seltzer
Stylianou
Stylianou
Tucker
Turunen
Vaseghi
Weber
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

Crossref

Southampton (e-Prints Soton)

University of East Anglia digital repository