15,247 research outputs found

    Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

    Full text link
    The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.Comment: Under submission as a conference pape

    Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

    Full text link
    This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronization task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.Comment: Preprint. Work in progres

    An Energy-Aware Protocol for Self-Organizing Heterogeneous LTE Systems

    Get PDF
    This paper studies the problem of self-organizing heterogeneous LTE systems. We propose a model that jointly considers several important characteristics of heterogeneous LTE system, including the usage of orthogonal frequency division multiple access (OFDMA), the frequency-selective fading for each link, the interference among different links, and the different transmission capabilities of different types of base stations. We also consider the cost of energy by taking into account the power consumption, including that for wireless transmission and that for operation, of base stations and the price of energy. Based on this model, we aim to propose a distributed protocol that improves the spectrum efficiency of the system, which is measured in terms of the weighted proportional fairness among the throughputs of clients, and reduces the cost of energy. We identify that there are several important components involved in this problem. We propose distributed strategies for each of these components. Each of the proposed strategies requires small computational and communicational overheads. Moreover, the interactions between components are also considered in the proposed strategies. Hence, these strategies result in a solution that jointly considers all factors of heterogeneous LTE systems. Simulation results also show that our proposed strategies achieve much better performance than existing ones

    FaceFilter: Audio-visual speech separation using still images

    Full text link
    The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples: https://youtu.be/ku9xoLh62

    Post-Crisis Financial Reform in Korea: A Critical Appraisal

    Get PDF
    In the aftermath of the economic crisis of 1997-98 South Korea has undertaken a number of financial reforms under IMF auspices. One of such reforms was in financial supervision, which created the Financial Supervisory Commission and the Financial Supervisory Service. In spite of these reforms Korea has recently experienced a costly financial instability relating to credit-card companies and household debts. Korea’s success in bringing about rapid economic recovery from the crisis may have lessened, as suggested by the World Bank, the urgency for full financial reform. This paper, however, argues that the newly created supervisory agencies, although created as independent agencies, have not in fact functioned as such and thus failed to carry out proper supervision over credit-card companies. It is argued that those agencies have not been able to function independently due to institutional constraints imposed on them by other extant, formal as well as informal, institutions in Korea.Financial reform, financial supervision, institutions
    • …
    corecore