267,256 research outputs found

    Audio-Visual Materials

    Get PDF
    published or submitted for publicatio

    Self-Supervised Audio-Visual Co-Segmentation

    Full text link
    Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms several baselines in semantic segmentation and sound source separation.Comment: Accepted to ICASSP 201

    Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audiovisual and auditory speech perception

    Get PDF
    Auditory and audio-visual speech perception was investigated using auditory signals of invariant spectral envelope that temporally encoded the presence of voiced and voiceless excitation, variations in amplitude envelope and F-0. In experiment 1, the contribution of the timing of voicing was compared in consonant identification to the additional effects of variations in F-0 and the amplitude of voiced speech. In audio-visual conditions only, amplitude variation slightly increased accuracy globally and for manner features. F-0 variation slightly increased overall accuracy and manner perception in auditory and audio-visual conditions. Experiment 2 examined consonant information derived from the presence and amplitude variation of voiceless speech in addition to that from voicing, F-0, and voiced speech amplitude. Binary indication of voiceless excitation improved accuracy overall and for voicing and manner. The amplitude variation of voiceless speech produced only a small increment in place of articulation scores. A final experiment examined audio-visual sentence perception using encodings of voiceless excitation and amplitude variation added to a signal representing voicing and F-0. There was a contribution of amplitude variation to sentence perception, but not of voiceless excitation. The timing of voiced and voiceless excitation appears to be the major temporal cues to consonant identity. (C) 1999 Acoustical Society of America. [S0001-4966(99)01410-1]

    A Means to an End, A Means to an End, A Means to an End: Repetition through two filmmakers and four films

    Get PDF
    Wong Kar-wai is an auteur whose body of work constitutes over 20+ short and feature films. Wong is a Hong Kong director with a unique audio-visual style. His audio-visual style works best to enhance and compliment his film’s themes of loss, love, and memory. Wong’s films are a body of work that can be compared to other filmmakers and examined under many different critical theory lenses. This essay analyzes Wong Kar-wai’s films using the lens of psychoanalysis. The focus is on the unofficial Wong Kar-wai trilogy: Days of Being Wild (1990), In the Mood for Love (2000), and 2046 (2004). Ultimately, this essay is the accompaniment to Ellipsis the short film that seeks to recreate the mood and tones of Wong Kar- wai’s work using his audio-visual style as a jumping point to think of ways I can develop my own style

    Audio-visual speech recognition with background music using single-channel source separation

    Get PDF
    In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system
    corecore