679 research outputs found

    Implementation of MFCC and SVM for Voice Command Recognition as Control on Mobile Robot

    Get PDF
    The mobile robot is a system that can move according to function and task. An example is an industrial robot taking objects using a remote control system. Robots controlled using a manual remote system are generally carried out on mobile robots. Many researchers have developed manual control methods, such as image or sound-based robot control. In this study, the mobile robot was applied in an unobstructed room and controlled using voice commands. The methods used are Mel-Frequency Cepstral Coefficients (MFCC) and Support Vector Machine (SVM). MFCC is a characteristic identification of voice command patterns such as “forward”, “backward”, “left”, “right”, and “stop.” SVM is used to recognize voice command patterns based on the value of the MFCC for each pattern. The experiment has been carried out 50 times with a success rate of 96%. Overall the robot can be controlled by voice commands with good movement.The mobile robot is a system that can move according to function and task. An example is an industrial robot taking objects using a remote control system. Robots controlled using a manual remote system are generally carried out on mobile robots. Many researchers have developed manual control methods, such as image or sound-based robot control. In this study, the mobile robot was applied in an unobstructed room and controlled using voice commands. The methods used are Mel-Frequency Cepstral Coefficients (MFCC) and Support Vector Machine (SVM). MFCC is a characteristic identification of voice command patterns such as “forward”, “backward”, “left”, “right”, and “stop.” SVM is used to recognize voice command patterns based on the value of the MFCC for each pattern. The experiment has been carried out 50 times with a success rate of 96%. Overall the robot can be controlled by voice commands with good movement

    Characterization and Decoding of Speech Representations From the Electrocorticogram

    Get PDF
    Millions of people worldwide suffer from various neuromuscular disorders such as amyotrophic lateral sclerosis (ALS), brainstem stroke, muscular dystrophy, cerebral palsy, and others, which adversely affect the neural control of muscles or the muscles themselves. The patients who are the most severely affected lose all voluntary muscle control and are completely locked-in, i.e., they are unable to communicate with the outside world in any manner. In the direction of developing neuro-rehabilitation techniques for these patients, several studies have used brain signals related to mental imagery and attention in order to control an external device, a technology known as a brain-computer interface (BCI). Some recent studies have also attempted to decode various aspects of spoken language, imagined language, or perceived speech directly from brain signals. In order to extend research in this direction, this dissertation aims to characterize and decode various speech representations popularly used in speech recognition systems directly from brain activity, specifically the electrocorticogram (ECoG). The speech representations studied in this dissertation range from simple features such as the speech power and the fundamental frequency (pitch), to complex representations such as the linear prediction coding and mel frequency cepstral coefficients. These decoded speech representations may eventually be used to enhance existing speech recognition systems or to reconstruct intended or imagined speech directly from brain activity. This research will ultimately pave the way for an ECoG-based neural speech prosthesis, which will offer a more natural communication channel for individuals who have lost the ability to speak normally

    The soundscape of swarming: Proof of concept for a noninvasive acoustic species identification of swarming Myotis bats

    Get PDF
    Bats emit echolocation calls to orientate in their predominantly dark environment. Recording of species‐specific calls can facilitate species identification, especially when mist netting is not feasible. However, some taxa, such as Myotis bats can be hard to distinguish acoustically. In crowded situations where calls of many individuals overlap, the subtle differences between species are additionally attenuated. Here, we sought to noninvasively study the phenology of Myotis bats during autumn swarming at a prominent hibernaculum. To do so, we recorded sequences of overlapping echolocation calls (N = 564) during nights of high swarming activity and extracted spectral parameters (peak frequency, start frequency, spectral centroid) and linear frequency cepstral coefficients (LFCCs), which additionally encompass the timbre (vocal “color”) of calls. We used this parameter combination in a stepwise discriminant function analysis (DFA) to classify the call sequences to species level. A set of previously identified call sequences of single flying Myotis daubentonii and Myotis nattereri, the most common species at our study site, functioned as a training set for the DFA. 90.2% of the call sequences could be assigned to either M. daubentonii or M. nattereri, indicating the predominantly swarming species at the time of recording. We verified our results by correctly classifying the second set of previously identified call sequences with an accuracy of 100%. In addition, our acoustic species classification corresponds well to the existing knowledge on swarming phenology at the hibernaculum. Moreover, we successfully classified call sequences from a different hibernaculum to species level and verified our classification results by capturing swarming bats while we recorded them. Our findings provide a proof of concept for a new noninvasive acoustic monitoring technique that analyses “swarming soundscapes” by combining classical acoustic parameters and LFCCs, instead of analyzing single calls. Our approach for species identification is especially beneficial in situations with multiple calling individuals, such as autumn swarming

    Investigating Voice as a Biomarker for Leucine-Rich Repeat Kinase 2-Associated Parkinson's Disease

    Get PDF
    We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD, the mean sensitivity was 95.4% (SD 17.8%) and mean specificity was 89.6% (SD 26.5%). Voice features for non-manifesting carriers, related non-carriers, and controls were much less discriminatory. Vocal deficits in LRRK2-associated PD may be different than those in iPD. These preliminary results warrant longitudinal analyses and replication in larger cohorts

    Bi-LSTM neural network for EEG-based error detection in musicians’ performance

    Get PDF
    Electroencephalography (EEG) is a tool that allows us to analyze brain activity with high temporal resolution. These measures, combined with deep learning and digital signal processing, are widely used in neurological disorder detection and emotion and mental activity recognition. In this paper, a new method for mental activity recognition is presented: instantaneous frequency, spectral entropy and Mel-frequency cepstral coefficients (MFCC) are used to classify EEG signals using bidirectional LSTM neural networks. It is shown that this method can be used for intra-subject or inter-subject analysis and has been applied to error detection in musician performance reaching compelling accuracy.This work has been funded by Junta de Andalucía in the framework of Proyectos I+D+I en el marco del Programa Operativo FEDER Andalucia 2014–2020 under Project No.: UMA18-FEDERJA-023, Proyectos de I+D+i en el ámbito del Plan Andaluz de Investigación, Desarrollo e Innovación (PAIDI 2020) under Project No.: PY20_00237 and Universidad de Málaga, Campus de Excelencia Internacional Andalucia Tech . Funding for open access charge: Universidad de Málaga/CBU

    The Use of Low-Cost Sensors and a Convolutional Neural Network to Detect and Classify Mini-Drones

    Get PDF
    The increasing commercial availability of mini-drones and quadrotors has led to their greater usage, highlighting the need for detection and classification systems to ensure safe operation. Instances of drones causing serious complications since 2019 alone include shutting down airports [1-2], spying on individuals [3-4], and smuggling drugs and prohibited items across borders and into prisons [5-6]. Some regulatory measures have been taken, such as registration of drones above a specific size and the establishment of no-fly zones in sensitive areas such as airports, military bases, and national parks. While commercial systems exist to detect drones [7-8], they are expensive, unreliable, and often rely on a single sensor. This thesis will explore the practicality of using low-cost, Commercial-off-the-shelf (COTS) sensors and machine learning to detect and classify drones. A Red, Green, and Blue (RGB) USB camera [9], FLIR Lepton 3.0 thermal camera [10], miniDSP UMA-16 acoustic microphone array [11], and a Garmin LIDAR [12] were mounted on a robotic sensor platform and integrated using a Minisforum Z83-F with 4GB RAM and Intel Atom x5-Z8350 CPU to collect data from drones operating in unstructured, outdoor, and real-world environments. Approximately 1,000 unique measurements were taken from three mini-drones – Parrot Swing, Parrot Quadcopter, and Tello Quadcopter – using the RGB, thermal, and acoustic sensors. Deep Convolutional Neural Network (CNNs), based on Resnet-50 [13-14], trained to classify the drones, achieved accuracies of 96.6% using the RGB images, 82.9% using the thermal images, and 71.3% using the passive acoustic microphone array

    Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis

    Get PDF
    Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique

    Fusion of Audio and Visual Information for Implementing Improved Speech Recognition System

    Get PDF
    Speech recognition is a very useful technology because of its potential to develop applications, which are suitable for various needs of users. This research is an attempt to enhance the performance of a speech recognition system by combining the visual features (lip movement) with audio features. The results were calculated using utterances of numerals collected from participants inclusive of both male and female genders. Discrete Cosine Transform (DCT) coefficients were used for computing visual features and Mel Frequency Cepstral Coefficients (MFCC) were used for computing audio features. The classification was then carried out using Support Vector Machine (SVM). The results obtained from the combined/fused system were compared with the recognition rates of two standalone systems (Audio only and visual only)

    Audio-Visual Speaker Identification using the CUAVE Database

    Get PDF
    The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identifications systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decisionfusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions

    Artificial Intelligence for Suicide Assessment using Audiovisual Cues: A Review

    Get PDF
    Death by suicide is the seventh leading death cause worldwide. The recent advancement in Artificial Intelligence (AI), specifically AI applications in image and voice processing, has created a promising opportunity to revolutionize suicide risk assessment. Subsequently, we have witnessed fast-growing literature of research that applies AI to extract audiovisual non-verbal cues for mental illness assessment. However, the majority of the recent works focus on depression, despite the evident difference between depression symptoms and suicidal behavior and non-verbal cues. This paper reviews recent works that study suicide ideation and suicide behavior detection through audiovisual feature analysis, mainly suicidal voice/speech acoustic features analysis and suicidal visual cues. Automatic suicide assessment is a promising research direction that is still in the early stages. Accordingly, there is a lack of large datasets that can be used to train machine learning and deep learning models proven to be effective in other, similar tasks.Comment: Manuscript submitted to Arificial Intelligence Reviews (2022
    • 

    corecore