1,799 research outputs found

    Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

    Vowel Recognition from Articulatory Position Time-Series Data

    Get PDF
    A new approach of recognizing vowels from articulatory position time-series data was proposed and tested in this paper. This approach directly mapped articulatory position time-series data to vowels without extracting articulatory features such as mouth opening. The input time-series data were time-normalized and sampled to fixed-width vectors of articulatory positions. Three commonly used classifiers, Neural Network, Support Vector Machine and Decision Tree were used and their performances were compared on the vectors. A single speaker dataset of eight major English vowels acquired using Electromagnetic Articulograph (EMA) AG500 was used. Recognition rate using cross validation ranged from 76.07% to 91.32% for the three classifiers. In addition, the trained decision trees were consistent with articulatory features commonly used to descriptively distinguish vowels in classical phonetics. The findings are intended to improve the accuracy and response time of a real-time articulatory-to-acoustics synthesizer

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    EVALUATING A MARKERLESS METHOD FOR STUDYING ARTICULATORY MOVEMENTS: APPLICATION TO A SYLLABLE REPETITION TASK

    Get PDF
    none4siThe analysis of the articulatory movements allows investigating the kinematic characteristics of some speech disorders. However, the methodologies most used until now, as electromagnetic articulography and optoelectronic systems, are expensive and intrusive which limit their use to specialized laboratories. In this work, we use a completely markerless and low-cost technique to study lip movements during a syllable repetition task. By means of a Kinect-like and an existing face tracking algorithm, we are able to track the movements of the lower lip, testing the performances against a reference method (marker-based optoelectronic system). Good results were obtained in terms of RMSE for the tracking of the lower lip during the repetitions. Some kinematic measures, as opening and closing velocities and accelerations, were also computed. Despite the limitations in terms of image resolution, these results are very promising in the optic of developing a new markerless system for studying speech articulation.noneBandini A.; Ouni S.; Orlandi S.; Manfredi C.Bandini A.; Ouni S.; Orlandi S.; Manfredi C

    Correlation between the cephalometric measurements and acoustic properties of /s/ sound in Turkish

    Get PDF
    Objectives: To evaluate the acoustic properties of the /s/ sound in individuals with different occlusion types and to investigate relationships between these properties and cephalometric measurements. Methodology: Sixty patients were divided into three groups based on malocclusion. Group 1 included 20 patients (mean age: 14.85±2.01 years) with Class I skeletal and dental relationships. Group 2 included 20 patients (mean age: 13.49±1.78 years) with Class II skeletal and dental relationships. Group 3 included 20 patients (mean age: 12.46±2.62 years) with Class III skeletal and dental relationships. Cephalometric tracings were obtained from cephalometric radiographs. All included patients were native speakers of Turkish. The /s/ sound was selected for center of gravity analysis. Correlations between cephalometric values and acoustic parameters were also investigated. Results: The center of gravity of the /s/ sound had the lowest value in Group 2 (p<0.05). For the /s/ sound in Group 3, moderate positive correlations were found between center of gravity and Sella-Nasion to Gonion-Gnathion angle (p<0.05, r=0.444) Lower incisor to Nasion-B point (p<0.023, r=0.505), and Lower incisor to Nasion-B point angle (p<0.034; r=0.476). No correlation was found in other cephalometric measurements. Conclusions: The /s/ sound was affected by malocclusion due to the changing place of articulation. Therefore, referral to an orthodontist for malocclusion treatment especially patients with class III in the early period is suggested for producing acoustically ideal sound

    Lip2AudSpec: Speech reconstruction from silent lip movements video

    Full text link
    In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy

    Kinematic and correlational analyses on labial and lingual functions during syllable repetitions in Cantonese dysarthric speakers with Parkinson's disease of varying severity using electromagnetic articulography (EMA)

    Get PDF
    Articulatory imprecision in Parkinson patients with hypokinetic dysarthria has been attributed to articulatory undershooting. However, contradictory results in terms of acoustics and instrumental investigation has been reported in the literature throughout the years. The present study aimed to investigate labial and lingual kinematics in dysarthric Cantonese speakers with Parkinson’s disease (PD) of different severity (in terms of dysarthria) during rapid syllable repetitions and compared the measures with that of healthy age-matched controls using a 3-dimensional Electromagnetic Articulography (EMA). Dysarthria severity was also correlated with labial and lingual kinematics. Tongue tip, tongue back, upper and lower lips and jaw motion in five PD and six normal participants during repetitions of /pa/, /ta/ and /ka/ were recorded. Participants were also rated perceptually on their dysarthria severity. When compared to the normal group, the PD group showed reduced velocity in lingual movement and reduced distance travelled and velocity in labial movements. Correlational analysis between dysarthria severity and kinematic data revealed positive correlation for duration of lingual movement. Negative correlation was identified for the velocity and rate of lingual movement, and for distance travelled and velocity of labial movement. The present results supported the hypothesis of articulatory undershooting as a contributing factor of articulatory imprecision in hypokinetic dysarthria, while tongue and lip tremor might also cause such consonant imprecision. Possible differential effect of dopamine deficiency on the different cranial nerves has been hypothesized. Keywords:published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science
    • …
    corecore