1,000 research outputs found

    Visual to Sound: Generating Natural Sound for Videos in the Wild

    Full text link
    As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.Comment: Project page: http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm

    A Study of the Assimilative Behavior of the Voiced Labio-Dental Fricative in American English

    Get PDF
    Gradation is one of the main features of colloquial speech. It implies the presence of certain phonological processes that ease the transition between phonemes with different articulatory features. For English, one of these implied processes is assimilation, which is when the articulation of a segment is modified into another one already existing in the system. Our study takes Gimson (1994)’s suggestion that /v/ assimilates into /m/ when it is followed by the bilabial nasal. After observing and describing different cases of assimilation, we suggest more possible explanations to this phenomenon and more assimilative behaviors of /v/. Therefore, we conduct an experiment with six American- English L1s where they evaluate sentences whose articulation includes our suggested proposals. The results show Gimson’s theory not to be as accurate as expected. Furthermore, we prove that /v/ can assimilate into /b/, /ɂ/ and /d/ when it is followed by bilabial, velar and alveolar phonemes.La gradación es una de las características más significativas del lenguaje coloquial. Esta implica la presencia de ciertos procesos fonológicos que facilitan la transición entre fonemas con distintas articulaciones. En el caso del inglés, uno de estos procesos es la asimilación, que consiste en cambiar la articulación de un segmento por la de otro existente en el sistema. Este estudio se basa en la propuesta de Gimson (1994), por la que /v/ se asimila a /m/ cuando le sigue la bilabial nasal. Tras observar y describir más casos de asimilación, nos planteamos distintos comportamientos asimilativos de /v/ en este y otros contextos, que fueron evaluados por medio de un experimento realizado a seis nativos de inglés-americano. Los resultados muestran que la teoría de Gimson no es tan apropiada como se esperaba. Además, concluimos que /v/ puede asimilar a /b/, /ɂ/ y /d/ cuando le siguen ciertos sonidos bilabiales, velares y alveolares.Grado en Estudios Inglese

    On the human evaluation of universal audio adversarial perturbations

    Get PDF
    [EN] Human-machine interaction is increasingly dependent on speech communication, mainly due to the remarkable performance of Machine Learning models in speech recognition tasks. However, these models can be fooled by adversarial examples, which are inputs in-tentionally perturbed to produce a wrong prediction without the changes being noticeable to humans. While much research has focused on developing new techniques to generate adversarial perturbations, less attention has been given to aspects that determine whether and how the perturbations are noticed by humans. This question is relevant since high fool-ing rates of proposed adversarial perturbation strategies are only valuable if the perturba-tions are not detectable. In this paper we investigate to which extent the distortion metrics proposed in the literature for audio adversarial examples, and which are commonly applied to evaluate the effectiveness of methods for generating these attacks, are a reliable mea-sure of the human perception of the perturbations. Using an analytical framework, and an experiment in which 36 subjects evaluate audio adversarial examples according to different factors, we demonstrate that the metrics employed by convention are not a reliable measure of the perceptual similarity of adversarial examples in the audio domain.This work was supported by the Basque Government (PRE_2019_1_0128 predoctoral grant, IT1244-19 and project KK-2020/00049 through the ELKARTEK program); the Spanish Ministry of Economy and Competitiveness MINECO (projects TIN2016-78365-R and PID2019-104966GB-I00); and the Spanish Ministry of Science, Innovation and Universities (FPU19/03231 predoctoral grant). The authors would also like to thank to the Intelligent Systems Group (University of the Basque Country UPV/EHU, Spain) for providing the computational resources needed to develop the project, as well as to all the participants that took part in the experiments
    corecore