11,813 research outputs found
Capacity and Complexity of HMM Duration Modeling Techniques
The ability of a standard hidden Markov model (HMM) or expanded state HMM (ESHMM) to accurately model duration distributions of phonemes is compared with specific duration-focused approaches such as semi-Markov models or variable transition probabilities. It is demonstrated that either a three-state ESHMM or a standard HMM with an increased number of states is capable of closely matching both Gamma distributions and duration distributions of phonemes from the TIMIT corpus, as measured by Bhattacharyya distance to the true distributions. Standard HMMs are easily implemented with off-the-shelf tools, whereas duration models require substantial algorithmic development and have higher computational costs when implemented, suggesting that a simple adjustment to HMM topologies is perhaps a more efficient solution to the problem of duration than more complex approaches
Optimal Calibration of PET Crystal Position Maps Using Gaussian Mixture Models
A method is developed for estimating optimal PET gamma-ray detector crystal position maps, for arbitrary crystal configurations, based on a binomial distribution model for scintillation photon arrival. The approach is based on maximum likelihood estimation of Gaussian mixture model parameters using crystal position histogram data, with determination of the position map taken from the posterior probability boundaries between mixtures. This leads to minimum probability of error crystal identification under the assumed model
Generalized Perceptual Linear Prediction (gPLP) Features for Animal Vocalization Analysis
A new feature extraction model, generalized perceptual linear prediction (gPLP), is developed to calculate a set of perceptually relevant features for digital signal analysis of animalvocalizations. The gPLP model is a generalized adaptation of the perceptual linear prediction model, popular in human speech processing, which incorporates perceptual information such as frequency warping and equal loudness normalization into the feature extraction process. Since such perceptual information is available for a number of animal species, this new approach integrates that information into a generalized model to extract perceptually relevant features for a particular species. To illustrate, qualitative and quantitative comparisons are made between the species-specific model, generalized perceptual linear prediction (gPLP), and the original PLP model using a set of vocalizations collected from captive African elephants (Loxodonta africana) and wild beluga whales (Delphinapterus leucas). The models that incorporate perceptional information outperform the original human-based models in both visualization and classification tasks
Automatic Classification of African Elephant (\u3cem\u3eLoxodonta africana\u3c/em\u3e) Follicular and Luteal Rumbles
Recent research in African elephant vocalizations has shown that there is evidence for acoustic differences in the rumbles of females based on the phase of their estrous cycle (1). One reason for these differences might be to attract a male for reproductive purposes. Since rumbles have a fundamental frequency near 10Hz, they attenuate slowly and can be heard over a distance of several kilometers. This research exploits differences in the rumbles to create an automatic classification system that can determine whether a female rumble was made during the luteal or follicular phase of the ovulatory cycle. This system could be used as the basis for a non-invasive technique to determine the reproductive status of a female African elephant. The classification system is based on current state-of-the-art human speech processing systems. Standard features and models are applied with the necessary modifications to account for the physiological, anatomical and language differences between humans and African elephants. The long-term goal of this research is to develop a universal analysis framework and robust feature set for animal vocalizations that can be applied to many species. This research represents an application of this framework. The vocalizations used for this study were collected from a group of three female captive elephants. The elephants are fitted with radio-transmitting microphone collars and released into one of three naturalistic yards on a daily basis. Although this data collection setup is good for determining the speaker of each vocalization, it suffers from many potential noise sources such as RF interference, passing vehicles, and the flapping of the elephant’s ears against the collar
Efficient Embedded Speech Recognition for Very Large Vocabulary Mandarin Car-Navigation Systems
Automatic speech recognition (ASR) for a very large vocabulary of isolated words is a difficult task on a resource-limited embedded device. This paper presents a novel fast decoding algorithm for a Mandarin speech recognition system which can simultaneously process hundreds of thousands of items and maintain high recognition accuracy. The proposed algorithm constructs a semi-tree search network based on Mandarin pronunciation rules, to avoid duplicate syllable matching and save redundant memory. Based on a two-stage fixed-width beam-search baseline system, the algorithm employs a variable beam-width pruning strategy and a frame-synchronous word-level pruning strategy to significantly reduce recognition time. This algorithm is aimed at an in-car navigation system in China and simulated on a standard PC workstation. The experimental results show that the proposed method reduces recognition time by nearly 6-fold and memory size nearly 2- fold compared to the baseline system, and causes less than 1% accuracy degradation for a 200,000 word recognition task
Tracking Articulator Movements Using Orientation Measurements
This paper introduces a new method to track articulator movements, specifically jaw position and angle, using 5 degree of freedom (5 DOF) orientation data. The approach uses a quaternion rotation method to accomplish this jaw tracking during speech using a single senor on the mandibular incisor. Data were collected using the NDI Wave Speech Research System for one pilot subject with various speech tasks. The degree of jaw rotation from the proposed approach is compared with traditional geometric calculation. Results show that the quaternion based method is able to describe jaw angle trajectory and gives more accurate and smooth estimation of jaw kinematics
Sensorimotor Adaptation of Speech Using Real-time Articulatory Resynthesis
Sensorimotor adaptation is an important focus in the study of motor learning for non-disordered speech, but has yet to be studied substantially for speech rehabilitation. Speech adaptation is typically elicited experimentally using LPC resynthesis to modify the sounds that a speaker hears himself producing. This method requires that the participant be able to produce a robust speech-acoustic signal and is therefore not well-suited for talkers with dysarthria. We have developed a novel technique using electromagnetic articulography (EMA) to drive an articulatory synthesizer. The acoustic output of the articulatory synthesizer can be perturbed experimentally to study auditory feedback effects on sensorimotor learning. This work aims to compare sensorimotor adaptation effects using our articulatory resynthesis method with effects from an established, acoustic-only method. Results suggest that the articulatory resynthesis method can elicit speech adaptation, but that the articulatory effects of the two methods differ
Vowel Production in Mandarin Accented English and American English: Kinematic and Acoustic Data from the Marquette University Mandarin Accented English Corpus
Few electromagnetic articulography (EMA) datasets are publicly available, and none have focused systematically on non-native accented speech. We introduce a kinematic-acoustic database of speech from 40 (gender and dialect balanced) participants producing upper-Midwestern American English (AE) L1 or Mandarin Accented English (MAE) L2 (Beijing or Shanghai dialect base). The Marquette University EMA-MAE corpus will be released publicly to help advance research in areas such as pronunciation modeling, acoustic-articulatory inversion, L1-L2 comparisons, pronunciation error detection, and accent modification training. EMA data were collected at a 400 Hz sampling rate with synchronous audio using the NDI Wave System. Articulatory sensors were placed on the midsagittal lips, lower incisors, and tongue blade and dorsum, as well as on the lip corner and lateral tongue body. Sensors provide five degree-of-freedom measurements including three-dimensional sensor position and two-dimensional orientation (pitch and roll). In the current work we analyze kinematic and acoustic variability between L1 and L2 vowels. We address the hypothesis that MAE is characterized by larger differences in the articulation of back vowels than front vowels and smaller vowel spaces compared to AE. The current results provide a seminal comparison of the kinematics and acoustics of vowel production between MAE and AE speakers
The Electromagnetic Articulography Mandarin Accented English (EMA-MAE) Corpus of Acoustic and 3D Articulatory Kinematic Data
There is a significant need for more comprehensive electromagnetic articulography (EMA) datasets that can provide matched acoustics and articulatory kinematic data with good spatial and temporal resolution. The Marquette University Electromagnetic Articulography Mandarin Accented English (EMA-MAE) corpus provides kinematic and acoustic data from 40 gender and dialect balanced speakers representing 20 Midwestern standard American English L1 speakers and 20 Mandarin Accented English (MAE) L2 speakers, half Beijing region dialect and half are Shanghai region dialect. Three dimensional EMA data were collected at a 400 Hz sampling rate using the NDI Wave system, with articulatory sensors on the midsagittal lips, lower incisors, tongue blade and dorsum, plus lateral lip corner and tongue body. Sensors provide three-dimensional position data as well as two-dimensional orientation data representing the orientation of the sensor plane. Data have been corrected for head movement relative to a fixed reference sensor and also adjusted using a biteplate calibration system to place the data in an articulatory working space relative to each subject\u27s individual midsagittal and maxillary occlusal planes. Speech materials include isolated words chosen to focus on specific contrasts between the English and Mandarin languages, as well as sentences and paragraphs for continuous speech, totaling approximately 45 minutes of data per subject. A beta version of the EMA-MAE corpus is now available, and the full corpus is in preparation for public release to help advance research in areas such as pronunciation modeling, acoustic-articulatory inversion, L1-L2 comparisons, pronunciation error detection, and accent modification training
Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion
Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data
- …