4 research outputs found

    Beyond the edge: Markerless pose estimation of speech articulators from ultrasound and camera images using DeepLabCut

    Get PDF
    Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and systems. Tongue surface contours interpolated from estimated and hand-labelled keypoints produced an average mean sum of distances (MSD) of 0.93, s.d. 0.46 mm, compared with 0.96, s.d. 0.39 mm, for two human labellers, and 2.3, s.d. 1.5 mm, for the best performing edge detection algorithm. A pilot set of simultaneous electromagnetic articulography (EMA) and ultrasound recordings demonstrated partial correlation among three physical sensor positions and the corresponding estimated keypoints and requires further investigation. The accuracy of the estimating lip aperture from a camera video was high, with a mean MSD of 0.70, s.d. 0.56, mm compared with 0.57, s.d. 0.48 mm for two human labellers. DeepLabCut was found to be a fast, accurate and fully automatic method of providing unique kinematic data for tongue, hyoid, jaw, and lips.https://doi.org/10.3390/s2203113322pubpub

    Reproducibility of extra-oral tongue motion in a normal cohort using a 3D motion capture system

    Get PDF
    Aim: The aim of this study was to determine the magnitude of reproducible tongue motion in a normal cohort of healthy individuals using a 3D motion capture system. Design: Single centre, case-controlled study. Materials and methods: Thirty volunteers comprising of 15 female and 15 male staff and students at Birmingham Dental hospital were recruited with an age range of 21 to 44, and mean age of 27.5 years. Volunteers had to meet inclusion and exclusion criteria, namely to be aged 18-60 and be medically fit and well. Subjects were imaged using a markerless, high fidelity 3D facial motion capture system. Two sets of motions were captured per subject; up-down and right-left tongue movement at two different time points, T1 and T2, at least 30 minutes apart. Following capture the all 3D images were re-orientated to the principle planes. Four stabilising landmarks were placed on the forehead and one tracked landmark on the tip of the tongue was used. T1 and T2 sequences were superimposed on to one another and dynamic time warping was used to account for variability in speed. Mean and absolute mean differences between the maximum tongue tip position in the x, y and z directions were calculated for each of the two movements at both time points and subsequently analysed. Results: The up and down range of motion (ROM) of the tongue was 48.3 ± 10.0mm (95% CI 45.7mm to 51.0mm) and for right to left ROM was 66.4 ± 7.8mm (95% CI 63.4mm to 69.4mm). Based on a paired t-test the mean displacement of the tongue tip in the x, y and z-direction, was not statistically significantly different between T1 and T2 for any of the tongue movements. The mean absolute differences of the tongue tip in the x, y and z-direction, at T1 and T2, were all statistically significantly less than iv 5.0mm. Apart from during tongue tip elevation in the z-direction which was not statistically significantly different to 5.0mm. Conclusions: This study has shown that ROM of the tongue is reproducible in the x, y and z-directions for right to left tongue movement with all differences in mean and mean absolute measurements being less than 5.0mm. However, for up to down tongue movement, whilst ROM of the tongue was reproducible in the x, y and z- directions, the upper limit of the 95% confidence interval for the mean absolute difference was greater than 5.0mm when the tongue is in its most elevated position in the z-direction. The average ROM in the up and down direction was 48.3mm and from right to left was 66.4mm. For interventional studies, differences in tongue tip position in the order of 6-7mm are likely to be due to a lack of reproducibility rather than treatment effect. This should be taken into account when designing future studies. It is vital to rehearse tongue motion to reduce the magnitude of the reproducibility error

    Fully-automated tongue detection in ultrasound images

    Get PDF
    Tracking the tongue in ultrasound images provides information about its shape and kinematics during speech. In this thesis, we propose engineering solutions to better exploit the existing frameworks and deploy them to convert a semi-automatic tongue contour tracking system to a fully-automatic one. Current methods for detecting/tracking the tongue require manual initialization or training using large amounts of labeled images. This work introduces a new method for extracting tongue contours in ultrasound images that requires no training nor manual intervention. The method consists in: (1) application of a phase symmetry filter to highlight regions possibly containing the tongue contour; (2) adaptive thresholding and rank ordering of grayscale intensities to select regions that include or are near the tongue contour; (3) skeletonization of these regions to extract a curve close to the tongue contour and (4) initialization of an accurate active contour from this curve. Two novel quality measures were also developed that predict the reliability of the method so that optimal frames can be chosen to confidently initialize fully automated tongue tracking. This is achieved by automatically generating and choosing a set of points that can replace the manually segmented points for a semi-automated tracking approach. To improve the accuracy of tracking, this work also incorporates two criteria to re-set the tracking approach from time to time so the entire tracking result does not depend on human refinements. Experiments were run on 16 free speech ultrasound recordings from healthy subjects and subjects with articulatory impairments due to Steinert’s disease. Fully automated and semi automated methods result in mean sum of distances errors of 1.01mm±0.57mm and 1.05mm± 0.63mm, respectively, showing that the proposed automatic initialization does not significantly alter accuracy. Moreover, the experiments show that the accuracy would improve with the proposed re-initialization (mean sum of distances error of 0.63mm±0.35mm)
    corecore