Search CORE

1,540 research outputs found

Probabilistic generative modeling of speech

Author: Zhang Yang
Publication venue
Publication date: 01/12/2015
Field of study

Speech processing refers to a set of tasks that involve speech analysis and synthesis. Most speech processing algorithms model a subset of speech parameters of interest and blur the rest using signal processing techniques and feature extraction. However, evidence shows that many speech parameters can be more accurately estimated if they are modeled jointly; speech synthesis also benefits from joint modeling. This thesis proposes a probabilistic generative model for speech called the Probabilistic Acoustic Tube (PAT). The highlights of the model are threefold. First, it is among the very first works to build a complete probabilistic model for speech. Second, it has a well-designed model for the phase spectrum of speech, which has been hard to model and often neglected. Third, it models the AM-FM effects in speech, which are perceptually significant but often ignored in frame-based speech processing algorithms. Experiment shows that the proposed model has good potential for a number of speech processing tasks

Illinois Digital Environment for Access to Learning and Scholarship Repository

Recommended from our members

Detection and Classification of Acoustic Scenes and Events

Author: Benetos E.
Giannoulis D.
Lagrange M.
Plumbley M. D.
Stowell D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening

City Research Online

Crossref

Queen Mary Research Online

Surrey Research Insight

Recommended from our members

Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification

Author: Almaadeed Noor
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2014
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content (i.e. text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system. A novel approach towards speaker identification is developed using wavelet analysis, and multiple neural networks including Probabilistic Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state- of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA). Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear. Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that the proposed scheme is one of the best candidates for the fusion of face and voice due to its low computational time and high recognition accuracy

Brunel University Research Archive

Histogram of gradients of Time-Frequency Representations for Audio scene detection

Author: Gasso Gilles
Rakotomamonjy Alain
Publication venue
Publication date: 01/01/2015
Field of study

This paper addresses the problem of audio scenes classification and contributes to the state of the art by proposing a novel feature. We build this feature by considering histogram of gradients (HOG) of time-frequency representation of an audio scene. Contrarily to classical audio features like MFCC, we make the hypothesis that histogram of gradients are able to encode some relevant informations in a time-frequency {representation:} namely, the local direction of variation (in time and frequency) of the signal spectral power. In addition, in order to gain more invariance and robustness, histogram of gradients are locally pooled. We have evaluated the relevance of {the novel feature} by comparing its performances with state-of-the-art competitors, on several datasets, including a novel one that we provide, as part of our contribution. This dataset, that we make publicly available, involves

19

classes and contains about

900

minutes of audio scene recording. We thus believe that it may be the next standard dataset for evaluating audio scene classification algorithms. Our comparison results clearly show that our HOG-based features outperform its competitor

arXiv.org e-Print Archive

HAL - Normandie Université

Water Pipeline Leakage Detection Based on Machine Learning and Wireless Sensor Networks

Author: Bøe Hans Jakob
Falkum Erik
Gjerstad Christer Lunde
Lystad June Ullevoldsæter
Martinsen Egil Wilhelm
Nordstrand Andreas Espetvedt
Reichelt Jon Gerhard
Tønnesen Arnfinn
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 12/11/2019
Field of study

The detection of water pipeline leakage is important to ensure that water supply networks can operate safely and conserve water resources. To address the lack of intelligent and the low efficiency of conventional leakage detection methods, this paper designs a leakage detection method based on machine learning and wireless sensor networks (WSNs). The system employs wireless sensors installed on pipelines to collect data and utilizes the 4G network to perform remote data transmission. A leakage triggered networking method is proposed to reduce the wireless sensor network’s energy consumption and prolong the system life cycle effectively. To enhance the precision and intelligence of leakage detection, we propose a leakage identification method that employs the intrinsic mode function, approximate entropy, and principal component analysis to construct a signal feature set and that uses a support vector machine (SVM) as a classifier to perform leakage detection. Simulation analysis and experimental results indicate that the proposed leakage identification method can effectively identify the water pipeline leakage and has lower energy consumption than the networking methods used in conventional wireless sensor networks

DigitalCommons@University of Nebraska

NORA - Norwegian Open Research Archives

Coded excitation and sub-band processing for blood velocity estmation in medical ultrasound

Author: Gran Fredrik
Jensen Jørgen Arendt
Udesen Jesper
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2007
Field of study

Online Research Database In Technology

Text-Independent Speaker Identification Using the Histogram Transform Model

Author: Guo Jun
Ma Zhanyu
Tan Zheng-Hua
Yu Hong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

VBN

Optimizing Stimulation Strategies in Cochlear Implants for Music Listening

Author: Maretic Petra
Publication venue: Lunds universitet/Matematisk statistik
Publication date: 01/01/2015
Field of study

Most cochlear implant (CI) strategies are optimized for speech characteristics while music enjoyment is signicantly below normal hearing performance. In this thesis, electrical stimulation strategies in CIs are analyzed for music input. A simulation chain consisting of two parallel paths, simulating normal hearing conditions and electrical hearing respectively, is utilized. One thesis objective is to congure and develop the sound processor of the CI chain to analyze dierent compression- and channel selection strategies to optimally capture the characteristics of music signals. A new set of knee points (KPs) for the compression function are investigated together with clustering of frequency bands. The N-of-M electrode selection strategy models the eect of a psychoacoustic masking threshold. In order to evaluate the performance of the CI model, the normal hearing model is considered a true reference. Similarity among the resulting neurograms of respective model are measured using the image analysis method Neurogram Similarity Index Measure (NSIM). The validation and resolution of NSIM is another objective of the thesis. Results indicate that NSIM is sensitive to no-activity regions in the neurograms and has diculties capturing small CI changes, i.e. compression settings. Further verication of the model setup is suggested together with investigating an alternative optimal electric hearing reference and/or objective similarity measure