Search CORE

125 research outputs found

Predicting the Level of Emotion by Means of Indonesian Speech Signal

Author: Gunawan Fergyanto E.
Idananta Kanyadian
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/06/2017
Field of study

Understanding human emotion is of importance for developing better and facilitating smooth interpersonal relations. It becomes much more important because human thinking process and behavior are strongly influenced by the emotion. Align with these needs, an expert system that capable of predicting the emotion state would be useful for many practical applications. Based on a speech signal, the system has been widely developed for various languages. This study intends to evaluate to which extent Mel-Frequency Cepstral Coefficients (MFCC) features, besides Teager energy feature, derived from Indonesian speech signal relates to four emotional types: happy, sad, angry, and fear. The study utilizes empirical data of nearly 300 speech signals collected from four amateur actors and actresses speaking 15 prescribed Indonesian sentences. Using support vector machine classifier, the empirical findings suggest that the Teager energy, as well as the first coefficient of MFCCs, are a crucial feature and the prediction can achieve the accuracy level of 86%. The accuracy increases quickly with a few initial MFCC features. The fourth and more features have negligible effects on the accuracy

TELKOMNIKA (Telecommunication Computing Electronics and Control)

Compact and Robust MFCC-based Space-Saving Audio Fingerprint Extraction for Efficient Music Identification on FM Broadcast Monitoring

Author: Myo Thet Htun
Publication venue: LPPM ITBis Lembah Dempo
Publication date: 01/12/2022
Field of study

The Myanmar music industry urgently needs an efficient broadcast monitoring system to solve copyright infringement issues and illegal benefit-sharing between artists and broadcasting stations. In this paper, a broadcast monitoring system is proposed for Myanmar FM radio stations by utilizing space-saving audio fingerprint extraction based on the Mel Frequency Cepstral Coefficient (MFCC). This study focused on reducing the memory requirement for fingerprint storage while preserving the robustness of the audio fingerprints to common distortions such as compression, noise addition, etc. In this system, a three-second audio clip is represented by a 2,712-bit fingerprint block. This significantly reduces the memory requirement when compared to Philips Robust Hashing (PRH), one of the dominant audio fingerprinting methods, where a three-second audio clip is represented by an 8,192-bit fingerprint block. The proposed system is easy to implement and achieves correct and speedy music identification even on noisy and distorted broadcast audio streams. In this research work, we deployed an audio fingerprint database of 7,094 songs and broadcast audio streams of four local FM channels in Myanmar to evaluate the performance of the proposed system. The experimental results showed that the system achieved reliable performance

Journal of ICT Research and Applications

Directory of Open Access Journals

ITB Journal

A study on different linear and non-linear filtering techniques of speech and speech recognition

Author: Bhattacharyya Kaustubh
Haque Minajul
Publication venue: Assam Don Bosco University
Publication date: 30/06/2019
Field of study

In any signal noise is an undesired quantity, however most of thetime every signal get mixed with noise at different levels of theirprocessing and application, due to which the information containedby the signal gets distorted and makes the whole signal redundant.A speech signal is very prominent with acoustical noises like bubblenoise, car noise, street noise etc. So for removing the noises researchershave developed various techniques which are called filtering. Basicallyall the filtering techniques are not suitable for every application,hence based on the type of application some techniques are betterthan the others. Broadly, the filtering techniques can be classifiedinto two categories i.e. linear filtering and non-linear filtering.In this paper a study is presented on some of the filtering techniqueswhich are based on linear and nonlinear approaches. These techniquesincludes different adaptive filtering based on algorithm like LMS,NLMS and RLS etc., Kalman filter, ARMA and NARMA time series applicationfor filtering, neural networks combine with fuzzy i.e. ANFIS. Thispaper also includes the application of various features i.e. MFCC,LPC, PLP and gamma for filtering and recognition

Assam Don Bosco University Journals

Detection of Seismic Infrasonic Elephant Rumbles Using Spectrogram-Based Machine Learning

Author: Costa A. M. J. V.
Edussooriya C. U. S.
Gamlath G. R. U. Y.
Hiroshan H. H. R.
Munasinghe S. R.
Pallikkonda C. S.
Publication venue
Publication date: 05/12/2023
Field of study

This paper presents an effective method of identifying elephant rumbles in infrasonic seismic signals. The design and implementation of electronic circuitry to amplify, filter, and digitize the seismic signals captured through geophones are presented. A collection of seismic infrasonic elephant rumbles was collected at a free-ranging area of an elephant orphanage in Sri Lanka. The seismic rumbles were converted to spectrograms, and several methods were used for spectral feature extraction. Using LasyPredict, the features extracted using different methods were fed into their corresponding machine-learning algorithms to train them for automatic seismic rumble identification. It was found that the Mel frequency cepstral coefficient (MFCC) together with the Ridge classifier machine learning algorithm produced the best performance in identifying seismic elephant rumbles. A novel method for denoising the spectrum that leads to enhanced accuracy in identifying seismic rumbles is also presented.Comment: 8 pages, 7 figures, journa

arXiv.org e-Print Archive

Continuous Authentication for Voice Assistants

Author: Fawaz Kassem
Feng Huan
Shin Kang G.
Publication venue
Publication date: 16/01/2017
Field of study

Voice has become an increasingly popular User Interaction (UI) channel, mainly contributing to the ongoing trend of wearables, smart vehicles, and home automation systems. Voice assistants such as Siri, Google Now and Cortana, have become our everyday fixtures, especially in scenarios where touch interfaces are inconvenient or even dangerous to use, such as driving or exercising. Nevertheless, the open nature of the voice channel makes voice assistants difficult to secure and exposed to various attacks as demonstrated by security researchers. In this paper, we present VAuth, the first system that provides continuous and usable authentication for voice assistants. We design VAuth to fit in various widely-adopted wearable devices, such as eyeglasses, earphones/buds and necklaces, where it collects the body-surface vibrations of the user and matches it with the speech signal received by the voice assistant's microphone. VAuth guarantees that the voice assistant executes only the commands that originate from the voice of the owner. We have evaluated VAuth with 18 users and 30 voice commands and find it to achieve an almost perfect matching accuracy with less than 0.1% false positive rate, regardless of VAuth's position on the body and the user's language, accent or mobility. VAuth successfully thwarts different practical attacks, such as replayed attacks, mangled voice attacks, or impersonation attacks. It also has low energy and latency overheads and is compatible with most existing voice assistants

arXiv.org e-Print Archive

Crossref

Singing voice detection in polyphonic music

Author: Rocamora Martín
Publication venue
Publication date
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Improving Automatic Speech Recognition on Endangered Languages

Author: Simha Kruthika Prasanna
Publication venue: RIT Scholar Works
Publication date: 01/11/2019
Field of study

As the world moves towards a more globalized scenario, it has brought along with it the extinction of several languages. It has been estimated that over the next century, over half of the world\u27s languages will be extinct, and an alarming 43% of the world\u27s languages are at different levels of endangerment or extinction already. The survival of many of these languages depends on the pressure imposed on the dwindling speakers of these languages. Often there is a strong correlation between endangered languages and the number and quality of recordings and documentations of each. But why do we care about preserving these less prevalent languages? The behavior of cultures is often expressed in the form of speech via one\u27s native language. The memories, ideas, major events, practices, cultures and lessons learnt, both individual as well as the community\u27s, are all communicated to the outside world via language. So, language preservation is crucial to understanding the behavior of these communities. Deep learning models have been shown to dramatically improve speech recognition accuracy but require large amounts of labelled data. Unfortunately, resource constrained languages typically fall short of the necessary data for successful training. To help alleviate the problem, data augmentation techniques fabricate many new samples from each sample. The aim of this master\u27s thesis is to examine the effect of different augmentation techniques on speech recognition of resource constrained languages. The augmentation methods being experimented with are noise augmentation, pitch augmentation, speed augmentation as well as voice transformation augmentation using Generative Adversarial Networks (GANs). This thesis also examines the effectiveness of GANs in voice transformation and its limitations. The information gained from this study will further augment the collection of data, specifically, in understanding the conditions required for the data to be collected in, so that GANs can effectively perform voice transformation. Training of the original data on the Deep Speech model resulted in 95.03% WER. Training the Seneca data on a Deep Speech model that was pretrained on an English dataset, reduced the WER to 70.43%. On adding 15 augmented samples per sample, the WER reduced to 68.33%. Finally, adding 25 augmented samples per sample, the WER reduced to 48.23%. Experiments to find the best augmentation method among noise addition, pitch variation, speed variation augmentation and GAN augmentation revealed that GAN augmentation performed the best, with a WER reduction to 60.03%

RIT Scholar Works