214 research outputs found

    Visually Guided Sound Source Separation using Cascaded Opponent Filter Network

    Get PDF
    The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained end-to-end using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL). Project page: https://ly-zhu.github.io/cof-net.Comment: main paper 14 pages, ref 3 pages, and supp 7 pages. Revised argument in section 3 and

    Teaching Development Project: Gene Expression Prediction With Deep Learning

    Get PDF
    Histone modifications are playing an important role in affecting gene regulation. In this thesis, a Convolutional Recurrent Neural Network is proposed and applied to predict gene expression levels from histone modification signals. Its two simplified variants: Convolutional Neural Network and Recurrent Neural Network, and one state-of-the-art baseline: DeepChrome are also discussed in this work. Their performances are evaluated with gene expression data that derived from Roadmap Epigenomics Mapping Consortium database by the Receiver Operating Characteristic, Area Under the Curve and statistical analysis. As a result, the Convolutional Recurrent Neural Network model achieves the best performance compared to the other models. For teaching development of a pattern recognition and machine learning course from Tampere University of Technology, an approach of integrating theory and practice is used. Video recording, weekly exercises and competition are worked as the auxiliary parts of lectures, which helps the students have a better understanding of the theoretical knowledge and learn how to solve different kind of practical problems. We used histone modification data for a competition on this course, and this competition would be discussed emphatically in this thesis. For the competition, we motivated the students to develop multiple machine learning algorithms to accurately predict gene expression levels on five core histone modification masks. During the period of this course, the competition Gene Expression Prediction received 888 entries that submitted by 105 teams with 184 players. This thesis includes the analysis and summary of the outcomes from the competition. Additionally, the learning assessment is also discussed in this thesis

    Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

    Get PDF
    In this paper, we perform audio-visual sound source separation, i.e. to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type (e.g. human playing instrument) and pre-trained motion detectors (e.g. keypoints or optical flows). However, at the same time, the models are limited to a certain application domain. In this paper, we address these limitations and make the following contributions: i) we propose a two-stage architecture, called Appearance and Motion network (AM-net), where the stages specialise to appearance and motion cues, respectively. The entire system is trained in a self-supervised manner; ii) we introduce an Audio-Motion Embedding (AME) framework to explicitly represent the motions that related to sound; iii) we propose an audio-motion transformer architecture for audio and motion feature fusion; iv) we demonstrate state-of-the-art performance on two challenging datasets (MUSIC-21 and AVE) despite the fact that we do not use any pre-trained keypoint detectors or optical flow estimators. Project page: https://lyzhu.github.io/self-supervised-motion-representationsacceptedVersionPeer reviewe

    Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

    Get PDF
    The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained in an end-to-end manner using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL).acceptedVersionPeer reviewe

    Leveraging Category Information for Single-Frame Visual Sound Source Separation

    Get PDF
    Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and complex data representations (e.g. optical flow trajectories). In contrast, we study simple yet efficient models for visual sound separation using only a single video frame. Furthermore, our models are able to exploit the information of the sound source category in the separation process. To this end, we propose two models where we assume that i) the category labels are available at the training time, or ii) we know if the training sample pairs are from the same or different category. The experiments with the MUSIC dataset show that our model obtains comparable or better performance compared to several recent baseline methods. The code is available at https://github.com/ly-zhu/Leveraging-Category-Information-for-Single-Frame-Visual-Sound-Source-Separation.acceptedVersionPeer reviewe

    V-SlowFast Network for Efficient Visual Sound Separation

    Get PDF
    The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: https://ly-zhu.github.io/V-SlowFastacceptedVersionPeer reviewe

    Effect Of Type And Quantity Of Inherent Alkali Cations On Alkali-silica Reaction

    Get PDF
    In this study, the macroscopical expansion induced by alkali-silica reaction (ASR) and its corresponding ASR products are investigated using ordinary Portland cement (OPC) mortar specimens with a gradient of boosted alkalis. Experimental results show that the expansion increases with the concentration of inherent alkalis. Sodium-boosted samples expand approximately three times as much as potassium-boosted samples. ASR gels that are present in aggregate veins are calcium-free and amorphous; the atomic ratios of ASR gels are nearly independent of the type and quantity of alkali cations. Aggregate ASR gel exudation occurs in high (≥2.5 %) sodium cases and produces potential Na-shlykovite. Crystalline and amorphous calcium-containing ASR products are present in aggregate vicinities in either Na- or K-boosted samples. The higher hydrophilicity of Na-gel in aggregate veins accounts for the larger expansion. Boosted alkali cations are more effective in ASR products formation than in exposing solution. A new observation that NaOH exposure inhibits ASR in K-boosted samples (zero expansion) is reported
    • …
    corecore