90 research outputs found
Field Operation Planning for Agricultural Vehicles: A Hierarchical Modeling Framework
Rosana G. Moreira, Editor-in-Chief; Texas A&M UniversityThis is a paper from International Commission of Agricultural Engineering (CIGR, Commission Internationale du Genie Rural) E-Journal Volume 9 (2007): Field Operation Planning for Agricultural Vehicles: A Hierarchical Modeling Framework. Manuscript PM 06 021. Vol. IX. February, 2007
Video-driven speech reconstruction using generative adversarial networks
Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capable of producing natural sounding, intelligible speech which is synchronised with the video. The performance of our model is evaluated on the GRID dataset for both speaker dependent and speaker independent scenarios. To the best of our knowledge this is the first method that maps video directly to raw audio and the first to produce intelligible speech when tested on previously unseen speakers. We evaluate the synthesised audio not only based on the sound quality but also on the accuracy of the spoken words
Optimal Dynamic Motion Sequence Generation for Multiple Harvesters
Rosana G. Moreira, Editor-in-Chief; Texas A&M UniversityThis is a paper from International Commission of Agricultural Engineering (CIGR, Commission Internationale du Genie Rural) E-Journal Volume 9 (2007): Optimal Dynamic Motion Sequence Generation for Multiple Harvesters. Manuscript ATOE 07 001. Vol. IX. July, 2007
Realistic speech-driven facial animation with GANs
Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks
Speech-driven facial animations improve speech-in-noise comprehension of humans
Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments
Mucopolysaccharidosis VI
Mucopolysaccharidosis VI (MPS VI) is a lysosomal storage disease with progressive multisystem involvement, associated with a deficiency of arylsulfatase B leading to the accumulation of dermatan sulfate. Birth prevalence is between 1 in 43,261 and 1 in 1,505,160 live births. The disorder shows a wide spectrum of symptoms from slowly to rapidly progressing forms. The characteristic skeletal dysplasia includes short stature, dysostosis multiplex and degenerative joint disease. Rapidly progressing forms may have onset from birth, elevated urinary glycosaminoglycans (generally >100 μg/mg creatinine), severe dysostosis multiplex, short stature, and death before the 2nd or 3rd decades. A more slowly progressing form has been described as having later onset, mildly elevated glycosaminoglycans (generally <100 μg/mg creatinine), mild dysostosis multiplex, with death in the 4th or 5th decades. Other clinical findings may include cardiac valve disease, reduced pulmonary function, hepatosplenomegaly, sinusitis, otitis media, hearing loss, sleep apnea, corneal clouding, carpal tunnel disease, and inguinal or umbilical hernia. Although intellectual deficit is generally absent in MPS VI, central nervous system findings may include cervical cord compression caused by cervical spinal instability, meningeal thickening and/or bony stenosis, communicating hydrocephalus, optic nerve atrophy and blindness. The disorder is transmitted in an autosomal recessive manner and is caused by mutations in the ARSB gene, located in chromosome 5 (5q13-5q14). Over 130 ARSB mutations have been reported, causing absent or reduced arylsulfatase B (N-acetylgalactosamine 4-sulfatase) activity and interrupted dermatan sulfate and chondroitin sulfate degradation. Diagnosis generally requires evidence of clinical phenotype, arylsulfatase B enzyme activity <10% of the lower limit of normal in cultured fibroblasts or isolated leukocytes, and demonstration of a normal activity of a different sulfatase enzyme (to exclude multiple sulfatase deficiency). The finding of elevated urinary dermatan sulfate with the absence of heparan sulfate is supportive. In addition to multiple sulfatase deficiency, the differential diagnosis should also include other forms of MPS (MPS I, II IVA, VII), sialidosis and mucolipidosis. Before enzyme replacement therapy (ERT) with galsulfase (Naglazyme®), clinical management was limited to supportive care and hematopoietic stem cell transplantation. Galsulfase is now widely available and is a specific therapy providing improved endurance with an acceptable safety profile. Prognosis is variable depending on the age of onset, rate of disease progression, age at initiation of ERT and on the quality of the medical care provided
Recommended from our members
Row-sensing templates: A generic 3D sensor-based approach to robot localization with respect to orchard row centerlines
Accurate robot localization relative to orchard row centerlines is essential for autonomous guidance where satellite signals are often obstructed by foliage. Existing sensor-based approaches rely on various features extracted from images and point clouds. However, any selected features are not available consistently, because the visual and geometrical characteristics of orchard rows change drastically when tree types, growth stages, canopy management practices, seasons, and weather conditions change. In this study, we introduce a novel localization method that does not rely on features; instead, it relies on the concept of a row-sensing template, which is the expected observation of a 3D sensor traveling in an orchard row, when the sensor is anywhere on the centerline and perfectly aligned with it. First, the template is built using a few measurements, provided that the sensor's true pose with respect to the centerline is available. Then, during navigation, the best pose estimate (and its confidence) is estimated by maximizing the match between the template and the sensed point cloud using particle-filtering. The method can adapt to various orchards and conditions by rebuilding the template. Experiments were performed in a vineyard, and in an orchard in different seasons. Results showed that the lateral mean absolute error (MAE) was less than 3.6% of the row width, and the heading MAE was less than 1.72°. Localization was robust, as errors did not increase when less than 75% of measurement points were missing. The results indicate that template-based localization can provide a generic approach for accurate and robust localization in real-world orchards
Decomposition of Agricultural Tasks into Robotic Behaviours
Rosana G. Moreira, Editor-in-Chief; Texas A&M UniversityThis is a paper from International Commission of Agricultural Engineering (CIGR, Commission Internationale du Genie Rural) E-Journal Volume 9 (2007): Decomposition of Agricultural Tasks into Robotic Behaviours. Manuscript PM 07 006. Vol. IX. October, 2007
End-to-End Speech-Driven Facial Animation with Temporal GANs
Speech-driven facial animation is the process which uses speech signals to automatically synthesize a talking character. The majority of work in this domain creates a mapping from audio features to visual features. This often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present a system for generating videos of a talking head, using a still image of a person and an audio clip containing speech, that doesn't rely on any handcrafted intermediate features. To the best of our knowledge, this is the first method capable of generating subject independent realistic videos directly from raw audio. Our method can generate videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. We achieve this by using a temporal GAN with 2 discriminators, which are capable of capturing different aspects of the video. The effect of each component in our system is quantified through an ablation study. The generated videos are evaluated based on their sharpness, reconstruction quality, and lip-reading accuracy. Finally, a user study is conducted, confirming that temporal GANs lead to more natural sequences than a static GAN-based approach
- …