66,066 research outputs found

    Adaptive DCTNet for Audio Signal Classification

    Full text link
    In this paper, we investigate DCTNet for audio signal classification. Its output feature is related to Cohen's class of time-frequency distributions. We introduce the use of adaptive DCTNet (A-DCTNet) for audio signals feature extraction. The A-DCTNet applies the idea of constant-Q transform, with its center frequencies of filterbanks geometrically spaced. The A-DCTNet is adaptive to different acoustic scales, and it can better capture low frequency acoustic information that is sensitive to human audio perception than features such as Mel-frequency spectral coefficients (MFSC). We use features extracted by the A-DCTNet as input for classifiers. Experimental results show that the A-DCTNet and Recurrent Neural Networks (RNN) achieve state-of-the-art performance in bird song classification rate, and improve artist identification accuracy in music data. They demonstrate A-DCTNet's applicability to signal processing problems.Comment: International Conference of Acoustic and Speech Signal Processing (ICASSP). New Orleans, United States, March, 201

    Speech emotion recognition based on SVM and KNN classifications fusion

    Get PDF
    Recognizing the sense of speech is one of the most active research topics in speech processing and in human-computer interaction programs. Despite a wide range of studies in this scope, there is still a long gap among the natural feelings of humans and the perception of the computer. In general, a sensory recognition system from speech can be divided into three main sections: attribute extraction, feature selection, and classification. In this paper, features of fundamental frequency (FEZ) (F0), energy (E), zero-crossing rate (ZCR), fourier parameter (FP), and various combinations of them are extracted from the data vector, Then, the principal component analysis (PCA) algorithm is used to reduce the number of features. To evaluate the system performance. The fusion of each emotional state will be performed later using support vector machine (SVM), K-nearest neighbor (KNN), In terms of comparison, similar experiments have been performed on the emotional speech of the German language, English language, and significant results were obtained by these comparisons

    Hybrid Mfcc And Lpc For Stuttering Assessment Using Neural Network

    Get PDF
    Stuttering is characterized by disfluencies, which disrupt the flow of speech. Traditional way of stuttering assessment is time consuming. The stuttering assessment results always inconsistent between different judges, because human perception on the stuttering event are different for each individual. The stuttering assessment system will reduce the tedious manual work and improve the consistency of the assessment result. The objective of this project is to develop classifier for prolongation and repetition disfluencies in speech using artificial neural network. Three different feature extraction was used in this project, which is Mel Frequency Cepstral Coefficient (MFCC), Linear Prediction Coefficient (LPC) and hybrid MFCC and LPC. The flow of the project were: 1) Stuttered speech data acquisition; 2) Word segmentation and categorization; 3) Feature extraction using 3 different methods; 4) Classification using neural pattern recognition in Matlab. The overall accuracy of the 3 different feature extraction used were 84.6% (LPC), 84.6% (MFCC) and 88.5% (hybrid MFCC and LPC). The classification accuracy using hybrid MFCC and LPC with respect to target classes, which were prolongation, repetition and fluent, were 66.7%, 92.3% and 96.3%. A disfluencies classifier had been developed with hybrid MFCC and LPC as feature extraction and ANN as a classifier. The overall performance of the disfluencies classifier is 88.5%

    A system for recognizing human emotions based on speech analysis and facial feature extraction: applications to Human-Robot Interaction

    Get PDF
    With the advance in Artificial Intelligence, humanoid robots start to interact with ordinary people based on the growing understanding of psychological processes. Accumulating evidences in Human Robot Interaction (HRI) suggest that researches are focusing on making an emotional communication between human and robot for creating a social perception, cognition, desired interaction and sensation. Furthermore, robots need to receive human emotion and optimize their behavior to help and interact with a human being in various environments. The most natural way to recognize basic emotions is extracting sets of features from human speech, facial expression and body gesture. A system for recognition of emotions based on speech analysis and facial features extraction can have interesting applications in Human-Robot Interaction. Thus, the Human-Robot Interaction ontology explains how the knowledge of these fundamental sciences is applied in physics (sound analyses), mathematics (face detection and perception), philosophy theory (behavior) and robotic science context. In this project, we carry out a study to recognize basic emotions (sadness, surprise, happiness, anger, fear and disgust). Also, we propose a methodology and a software program for classification of emotions based on speech analysis and facial features extraction. The speech analysis phase attempted to investigate the appropriateness of using acoustic (pitch value, pitch peak, pitch range, intensity and formant), phonetic (speech rate) properties of emotive speech with the freeware program PRAAT, and consists of generating and analyzing a graph of speech signals. The proposed architecture investigated the appropriateness of analyzing emotive speech with the minimal use of signal processing algorithms. 30 participants to the experiment had to repeat five sentences in English (with durations typically between 0.40 s and 2.5 s) in order to extract data relative to pitch (value, range and peak) and rising-falling intonation. Pitch alignments (peak, value and range) have been evaluated and the results have been compared with intensity and speech rate. The facial feature extraction phase uses the mathematical formulation (B\ue9zier curves) and the geometric analysis of the facial image, based on measurements of a set of Action Units (AUs) for classifying the emotion. The proposed technique consists of three steps: (i) detecting the facial region within the image, (ii) extracting and classifying the facial features, (iii) recognizing the emotion. Then, the new data have been merged with reference data in order to recognize the basic emotion. Finally, we combined the two proposed algorithms (speech analysis and facial expression), in order to design a hybrid technique for emotion recognition. Such technique have been implemented in a software program, which can be employed in Human-Robot Interaction. The efficiency of the methodology was evaluated by experimental tests on 30 individuals (15 female and 15 male, 20 to 48 years old) form different ethnic groups, namely: (i) Ten adult European, (ii) Ten Asian (Middle East) adult and (iii) Ten adult American. Eventually, the proposed technique made possible to recognize the basic emotion in most of the cases

    Gender detection in children’s speech utterances for human-robot interaction

    Get PDF
    The human voice speech essentially includes paralinguistic information used in many real-time applications. Detecting the children’s gender is considered a challenging task compared to the adult’s gender. In this study, a system for human-robot interaction (HRI) is proposed to detect the gender in children’s speech utterances without depending on the text. The robot's perception includes three phases: Feature’s extraction phase where four formants are measured at each glottal pulse and then a median is calculated across these measurements. After that, three types of features are measured which are formant average (AF), formant dispersion (DF), and formant position (PF). Feature’s standardization phase where the measured feature dimensions are standardized using the z-score method. The semantic understanding phase is where the children’s gender is detected accurately using the logistic regression classifier. At the same time, the action of the robot is specified via a speech response using the text to speech (TTS) technique. Experiments are conducted on the Carnegie Mellon University (CMU) Kids dataset to measure the suggested system’s performance. In the suggested system, the overall accuracy is 98%. The results show a relatively clear improvement in terms of accuracy of up to 13% compared to related works that utilized the CMU Kids dataset

    Spoken affect classification : algorithms and experimental implementation : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand

    Get PDF
    Machine-based emotional intelligence is a requirement for natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech. This comparison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have gone unexplored in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods

    Multilanguage speech-based gender classification using time-frequency features and SVM classifier

    Get PDF
    Speech is the most significant communication mode among human beings and a potential method for human-computer interaction (HCI). Being unparallel in complexity, the perception of human speech is very hard. The most crucial characteristic of speech is gender, and for the classification of gender often pitch is utilized. However, it is not a reliable method for gender classification as in numerous cases, the pitch of female and male is nearly similar. In this paper, we propose a time-frequency method for the classification of gender-based on the speech signal. Various techniques like framing, Fast Fourier Transform (FFT), auto-correlation, filtering, power calculations, speech frequency analysis, and feature extraction and formation are applied on speech samples. The classification is done based on features derived from the frequency and time domain processing using the Support Vector Machines (SVM) algorithm. SVM is trained on two speech databases Berlin Emo-DB and IITKGP-SEHSC, in which a total of 400 speech samples are evaluated. An accuracy of 83% and 81% for IITKGP-SEHSC and Berlin Emo-DB have been observed, respectively
    corecore