4,917 research outputs found
Continuous Action Recognition Based on Sequence Alignment
Continuous action recognition is more challenging than isolated recognition
because classification and segmentation must be simultaneously carried out. We
build on the well known dynamic time warping (DTW) framework and devise a novel
visual alignment technique, namely dynamic frame warping (DFW), which performs
isolated recognition based on per-frame representation of videos, and on
aligning a test sequence with a model sequence. Moreover, we propose two
extensions which enable to perform recognition concomitant with segmentation,
namely one-pass DFW and two-pass DFW. These two methods have their roots in the
domain of continuous recognition of speech and, to the best of our knowledge,
their extension to continuous visual action recognition has been overlooked. We
test and illustrate the proposed techniques with a recently released dataset
(RAVEL) and with two public-domain datasets widely used in action recognition
(Hollywood-1 and Hollywood-2). We also compare the performances of the proposed
isolated and continuous recognition algorithms with several recently published
methods
Combining Multiple Views for Visual Speech Recognition
Visual speech recognition is a challenging research problem with a particular
practical application of aiding audio speech recognition in noisy scenarios.
Multiple camera setups can be beneficial for the visual speech recognition
systems in terms of improved performance and robustness. In this paper, we
explore this aspect and provide a comprehensive study on combining multiple
views for visual speech recognition. The thorough analysis covers fusion of all
possible view angle combinations both at feature level and decision level. The
employed visual speech recognition system in this study extracts features
through a PCA-based convolutional neural network, followed by an LSTM network.
Finally, these features are processed in a tandem system, being fed into a
GMM-HMM scheme. The decision fusion acts after this point by combining the
Viterbi path log-likelihoods. The results show that the complementary
information contained in recordings from different view angles improves the
results significantly. For example, the sentence correctness on the test set is
increased from 76% for the highest performing single view () to up to
83% when combining this view with the frontal and view angles
Automatic Speech Recognition for Indonesian using Linear Predictive Coding (LPC) and Hidden Markov Model (HMM)
Speech recognition is influential signal processing in communication technology. Speech recognition has allowed software to recognize the spoken word. Automatic speech recognition could be a solution to recognize the spoken word. This application was developed using Linear Predictive Coding (LPC) for feature extraction of speech signal and Hidden Markov Model (HMM) for generating the model of each the spoken word. The data of speech used for training and testing was produced by 10 speaker (5 men and 5 women) whose each speakers spoke 10 words and each of words spoken for 10 times. This research is tested using 10-fold cross validation for each pair LPC order and HMM states. System performance is measured based on the average accuracy testing from men and women speakers. According to the test results that the amount of HMM states affect the accuracy of system and the best accuracy is 94, 20% using LPC order =13 and HMM state=16
Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion
Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data
Automatic Speech Recognition for Indonesian using Linear Predictive Coding (LPC) and Hidden Markov Model (HMM)
Speech recognition is influential signal processing in communication technology. Speech recognition has allowed software to recognize the spoken word. Automatic speech recognition could be a solution to recognize the spoken word. This application was developed using Linear Predictive Coding (LPC) for feature extraction of speech signal and Hidden Markov Model (HMM) for generating the model of each the spoken word. The data of speech used for training and testing was produced by 10 speaker (5 men and 5 women) whose each speakers spoke 10 words and each of words spoken for 10 times. This research is tested using 10-fold cross validation for each pair LPC order and HMM states. System performance is measured based on the average accuracy testing from men and women speakers. According to the test results that the amount of HMM states affect the accuracy of system and the best accuracy is 94, 20% using LPC order =13 and HMM state=16
Analysing the importance of different visual feature coefficients
A study is presented to determine the relative importance of different visual features for speech recognition which includes pixel-based, model-based, contour-based and physical features. Analysis to determine the discriminability of features is per- formed through F-ratio and J-measures for both static and tem- poral derivatives, the results of which were found to correlate highly with speech recognition accuracy (r = 0.97). Princi- pal component analysis is then used to combine all visual fea- tures into a single feature vector, of which further analysis is performed on the resulting basis functions. An optimal feature vector is obtained which outperforms the best individual feature (AAM) with 93.5 % word accuracy
A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition
This article provides a unifying Bayesian network view on various approaches
for acoustic model adaptation, missing feature, and uncertainty decoding that
are well-known in the literature of robust automatic speech recognition. The
representatives of these classes can often be deduced from a Bayesian network
that extends the conventional hidden Markov models used in speech recognition.
These extensions, in turn, can in many cases be motivated from an underlying
observation model that relates clean and distorted feature vectors. By
converting the observation models into a Bayesian network representation, we
formulate the corresponding compensation rules leading to a unified view on
known derivations as well as to new formulations for certain approaches. The
generic Bayesian perspective provided in this contribution thus highlights
structural differences and similarities between the analyzed approaches
- …