5 research outputs found
Unconstrained face mask and face-hand interaction datasets: building a computer vision system to help prevent the transmission of COVID-19
Health organizations advise social distancing, wearing face mask, and avoiding touching face to prevent the spread of coronavirus. Based on these protective measures, we developed a computer vision system to help prevent the transmission of COVID-19. Specifically, the developed system performs face mask detection, face-hand interaction detection, and measures social distance. To train and evaluate the developed system, we collected and annotated images that represent face mask usage and face-hand interaction in the real world. Besides assessing the performance of the developed system on our own datasets, we also tested it on existing datasets in the literature without performing any adaptation on them. In addition, we proposed a module to track social distance between people. Experimental results indicate that our datasets represent the real-world’s diversity well. The proposed system achieved very high performance and generalization capacity for face mask usage detection, face-hand interaction detection, and measuring social distance in a real-world scenario on unseen data. The datasets are available at https://github.com/iremeyiokur/COVID-19-Preventions-Control-System
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
In this paper, we propose a neural end-to-end system for voice preserving,
lip-synchronous translation of videos. The system is designed to combine
multiple component models and produces a video of the original speaker speaking
in the target language that is lip-synchronous with the target speech, yet
maintains emphases in speech, voice characteristics, face video of the original
speaker. The pipeline starts with automatic speech recognition including
emphasis detection, followed by a translation model. The translated text is
then synthesized by a Text-to-Speech model that recreates the original emphases
mapped from the original sentence. The resulting synthetic voice is then mapped
back to the original speakers' voice using a voice conversion model. Finally,
to synchronize the lips of the speaker with the translated audio, a conditional
generative adversarial network-based model generates frames of adapted lip
movements with respect to the input face image as well as the output of the
voice conversion model. In the end, the system combines the generated video
with the converted audio to produce the final output. The result is a video of
a speaker speaking in another language without actually knowing it. To evaluate
our design, we present a user study of the complete system as well as separate
evaluations of the single components. Since there is no available dataset to
evaluate our whole system, we collect a test set and evaluate our system on
this test set. The results indicate that our system is able to generate
convincing videos of the original speaker speaking the target language while
preserving the original speaker's characteristics. The collected dataset will
be shared