15,608 research outputs found
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
This paper proposes a new strategy for learning powerful cross-modal
embeddings for audio-to-video synchronization. Here, we set up the problem as
one of cross-modal retrieval, where the objective is to find the most relevant
audio segment given a short video clip. The method builds on the recent
advances in learning representations from cross-modal self-supervision.
The main contributions of this paper are as follows: (1) we propose a new
learning strategy where the embeddings are learnt via a multi-way matching
problem, as opposed to a binary classification (matching or non-matching)
problem as proposed by recent papers; (2) we demonstrate that performance of
this method far exceeds the existing baselines on the synchronization task; (3)
we use the learnt embeddings for visual speech recognition in self-supervision,
and show that the performance matches the representations learnt end-to-end in
a fully-supervised manner.Comment: Preprint. Work in progres
An Energy-Aware Protocol for Self-Organizing Heterogeneous LTE Systems
This paper studies the problem of self-organizing heterogeneous LTE systems.
We propose a model that jointly considers several important characteristics of
heterogeneous LTE system, including the usage of orthogonal frequency division
multiple access (OFDMA), the frequency-selective fading for each link, the
interference among different links, and the different transmission capabilities
of different types of base stations. We also consider the cost of energy by
taking into account the power consumption, including that for wireless
transmission and that for operation, of base stations and the price of energy.
Based on this model, we aim to propose a distributed protocol that improves the
spectrum efficiency of the system, which is measured in terms of the weighted
proportional fairness among the throughputs of clients, and reduces the cost of
energy. We identify that there are several important components involved in
this problem. We propose distributed strategies for each of these components.
Each of the proposed strategies requires small computational and
communicational overheads. Moreover, the interactions between components are
also considered in the proposed strategies. Hence, these strategies result in a
solution that jointly considers all factors of heterogeneous LTE systems.
Simulation results also show that our proposed strategies achieve much better
performance than existing ones
FaceFilter: Audio-visual speech separation using still images
The objective of this paper is to separate a target speaker's speech from a
mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled
speaker information as an auxiliary conditional feature, we use a single face
image of the target speaker. In this task, the conditional feature is obtained
from facial appearance in cross-modal biometric task, where audio and visual
identity representations are shared in latent space. Learnt identities from
facial images enforce the network to isolate matched speakers and extract the
voices from mixed speech. It solves the permutation problem caused by swapped
channel outputs, frequently occurred in speech separation tasks. The proposed
method is far more practical than video-based speech separation since user
profile images are readily available on many platforms. Also, unlike
speaker-aware separation methods, it is applicable on separation with unseen
speakers who have never been enrolled before. We show strong qualitative and
quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples:
https://youtu.be/ku9xoLh62
Post-Crisis Financial Reform in Korea: A Critical Appraisal
In the aftermath of the economic crisis of 1997-98 South Korea has undertaken a number of financial reforms under IMF auspices. One of such reforms was in financial supervision, which created the Financial Supervisory Commission and the Financial Supervisory Service. In spite of these reforms Korea has recently experienced a costly financial instability relating to credit-card companies and household debts. Korea’s success in bringing about rapid economic recovery from the crisis may have lessened, as suggested by the World Bank, the urgency for full financial reform. This paper, however, argues that the newly created supervisory agencies, although created as independent agencies, have not in fact functioned as such and thus failed to carry out proper supervision over credit-card companies. It is argued that those agencies have not been able to function independently due to institutional constraints imposed on them by other extant, formal as well as informal, institutions in Korea.Financial reform, financial supervision, institutions
- …