4 research outputs found
Persian Vowel recognition with MFCC and ANN on PCVC speech dataset
In this paper a new method for recognition of consonant-vowel phonemes
combination on a new Persian speech dataset titled as PCVC (Persian
Consonant-Vowel Combination) is proposed which is used to recognize Persian
phonemes. In PCVC dataset, there are 20 sets of audio samples from 10 speakers
which are combinations of 23 consonant and 6 vowel phonemes of Persian
language. In each sample, there is a combination of one vowel and one
consonant. First, the consonant phoneme is pronounced and just after it, the
vowel phoneme is pronounced. Each sound sample is a frame of 2 seconds of
audio. In every 2 seconds, there is an average of 0.5 second speech and the
rest is silence. In this paper, the proposed method is the implementations of
the MFCC (Mel Frequency Cepstrum Coefficients) on every partitioned sound
sample. Then, every train sample of MFCC vector is given to a multilayer
perceptron feed-forward ANN (Artificial Neural Network) for training process.
At the end, the test samples are examined on ANN model for phoneme recognition.
After training and testing process, the results are presented in recognition of
vowels. Then, the average percent of recognition for vowel phonemes are
computed.Comment: The 5th International Conference of Electrical Engineering, Computer
Science and Information Technology 201
The Recognition Of Persian Phonemes Using PPNet
In this paper, a novel approach is proposed for the recognition of Persian
phonemes in the Persian Consonant-Vowel Combination (PCVC) speech dataset.
Nowadays, deep neural networks play a crucial role in classification tasks.
However, the best results in speech recognition are not yet as perfect as human
recognition rate. Deep learning techniques show outstanding performance over
many other classification tasks like image classification, document
classification, etc. Furthermore, the performance is sometimes better than a
human. The reason why automatic speech recognition (ASR) systems are not as
qualified as the human speech recognition system, mostly depends on features of
data which is fed to deep neural networks. Methods: In this research, firstly,
the sound samples are cut for the exact extraction of phoneme sounds in 50ms
samples. Then, phonemes are divided into 30 groups, containing 23 consonants, 6
vowels, and a silence phoneme. Results: The short-time Fourier transform (STFT)
is conducted on them, and the results are given to PPNet (A new deep
convolutional neural network architecture) classifier and a total average of
75.87% accuracy is reached which is the best result ever compared to other
algorithms on separated Persian phonemes (Like in PCVC speech dataset).
Conclusion: This method can be used not only for recognizing mono-phonemes but
also it can be adopted as an input to the selection of the best words in speech
transcription.Comment: Accepted in "Journal of Medical Signals & Sensors". arXiv admin note:
substantial text overlap with arXiv:1812.0695
Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition
Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims
to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases
which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT
algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of
vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels
especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community
PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme Injection
In this paper, we propose PhantomSound, a query-efficient black-box attack
toward voice assistants. Existing black-box adversarial attacks on voice
assistants either apply substitution models or leverage the intermediate model
output to estimate the gradients for crafting adversarial audio samples.
However, these attack approaches require a significant amount of queries with a
lengthy training stage. PhantomSound leverages the decision-based attack to
produce effective adversarial audios, and reduces the number of queries by
optimizing the gradient estimation. In the experiments, we perform our attack
against 4 different speech-to-text APIs under 3 real-world scenarios to
demonstrate the real-time attack impact. The results show that PhantomSound is
practical and robust in attacking 5 popular commercial voice controllable
devices over the air, and is able to bypass 3 liveness detection mechanisms
with >95% success rate. The benchmark result shows that PhantomSound can
generate adversarial examples and launch the attack in a few minutes. We
significantly enhance the query efficiency and reduce the cost of a successful
untargeted and targeted adversarial attack by 93.1% and 65.5% compared with the
state-of-the-art black-box attacks, using merely ~300 queries (~5 minutes) and
~1,500 queries (~25 minutes), respectively.Comment: RAID 202