1,799 research outputs found
Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion
Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data
Vowel Recognition from Articulatory Position Time-Series Data
A new approach of recognizing vowels from articulatory position time-series data was proposed and tested in this paper. This approach directly mapped articulatory position time-series data to vowels without extracting articulatory features such as mouth opening. The input time-series data were time-normalized and sampled to fixed-width vectors of articulatory positions. Three commonly used classifiers, Neural Network, Support Vector Machine and Decision Tree were used and their performances were compared on the vectors. A single speaker dataset of eight major English vowels acquired using Electromagnetic Articulograph (EMA) AG500 was used. Recognition rate using cross validation ranged from 76.07% to 91.32% for the three classifiers. In addition, the trained decision trees were consistent with articulatory features commonly used to descriptively distinguish vowels in classical phonetics. The findings are intended to improve the accuracy and response time of a real-time articulatory-to-acoustics synthesizer
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
EVALUATING A MARKERLESS METHOD FOR STUDYING ARTICULATORY MOVEMENTS: APPLICATION TO A SYLLABLE REPETITION TASK
none4siThe analysis of the articulatory movements allows investigating the kinematic characteristics of some speech disorders. However, the methodologies most used until now, as electromagnetic articulography and optoelectronic systems, are expensive and intrusive which limit their use to specialized laboratories. In this work, we use a completely markerless and low-cost technique to study lip movements during a syllable repetition task. By means of a Kinect-like and an existing face tracking algorithm, we are able to track the movements of the lower lip, testing the performances against a reference method (marker-based optoelectronic system). Good results were obtained in terms of RMSE for the tracking of the lower lip during the repetitions. Some kinematic measures, as opening and closing velocities and accelerations, were also computed. Despite the limitations in terms of image resolution, these results are very promising in the optic of developing a new markerless system for studying speech articulation.noneBandini A.; Ouni S.; Orlandi S.; Manfredi C.Bandini A.; Ouni S.; Orlandi S.; Manfredi C
Correlation between the cephalometric measurements and acoustic properties of /s/ sound in Turkish
Objectives: To evaluate the acoustic properties of the /s/ sound in individuals with different occlusion types and to investigate relationships between these properties and cephalometric measurements. Methodology: Sixty patients were divided into three groups based on malocclusion. Group 1 included 20 patients (mean age: 14.85±2.01 years) with Class I skeletal and dental relationships. Group 2 included 20 patients (mean age: 13.49±1.78 years) with Class II skeletal and dental relationships. Group 3 included 20 patients (mean age: 12.46±2.62 years) with Class III skeletal and dental relationships. Cephalometric tracings were obtained from cephalometric radiographs. All included patients were native speakers of Turkish. The /s/ sound was selected for center of gravity analysis. Correlations between cephalometric values and acoustic parameters were also investigated. Results: The center of gravity of the /s/ sound had the lowest value in Group 2 (p<0.05). For the /s/ sound in Group 3, moderate positive correlations were found between center of gravity and Sella-Nasion to Gonion-Gnathion angle (p<0.05, r=0.444) Lower incisor to Nasion-B point (p<0.023, r=0.505), and Lower incisor to Nasion-B point angle (p<0.034; r=0.476). No correlation was found in other cephalometric measurements. Conclusions: The /s/ sound was affected by malocclusion due to the changing place of articulation. Therefore, referral to an orthodontist for malocclusion treatment especially patients with class III in the early period is suggested for producing acoustically ideal sound
Lip2AudSpec: Speech reconstruction from silent lip movements video
In this study, we propose a deep neural network for reconstructing
intelligible speech from silent lip movement videos. We use auditory
spectrogram as spectral representation of speech and its corresponding sound
generation method resulting in a more natural sounding reconstructed speech.
Our proposed network consists of an autoencoder to extract bottleneck features
from the auditory spectrogram which is then used as target to our main lip
reading network comprising of CNN, LSTM and fully connected layers. Our
experiments show that the autoencoder is able to reconstruct the original
auditory spectrogram with a 98% correlation and also improves the quality of
reconstructed speech from the main lip reading network. Our model, trained
jointly on different speakers is able to extract individual speaker
characteristics and gives promising results of reconstructing intelligible
speech with superior word recognition accuracy
Kinematic and correlational analyses on labial and lingual functions during syllable repetitions in Cantonese dysarthric speakers with Parkinson's disease of varying severity using electromagnetic articulography (EMA)
Articulatory imprecision in Parkinson patients with hypokinetic dysarthria has been attributed to articulatory undershooting. However, contradictory results in terms of acoustics and instrumental investigation has been reported in the literature throughout the years. The present study aimed to investigate labial and lingual kinematics in dysarthric Cantonese speakers with Parkinson’s disease (PD) of different severity (in terms of dysarthria) during rapid syllable repetitions and compared the measures with that of healthy age-matched controls using a 3-dimensional Electromagnetic Articulography (EMA). Dysarthria severity was also correlated with labial and lingual kinematics. Tongue tip, tongue back, upper and lower lips and jaw motion in five PD and six normal participants during repetitions of /pa/, /ta/ and /ka/ were recorded. Participants were also rated perceptually on their dysarthria severity. When compared to the normal group, the PD group showed reduced velocity in lingual movement and reduced distance travelled and velocity in labial movements. Correlational analysis between dysarthria severity and kinematic data revealed positive correlation for duration of lingual movement. Negative correlation was identified for the velocity and rate of lingual movement, and for distance travelled and velocity of labial movement. The present results supported the hypothesis of articulatory undershooting as a contributing factor of articulatory imprecision in hypokinetic dysarthria, while tongue and lip tremor might also cause such consonant imprecision. Possible differential effect of dopamine deficiency on the different cranial nerves has been hypothesized.
Keywords:published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science
- …