7 research outputs found
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Large-scale noisy web image-text datasets have been proven to be efficient
for learning robust vision-language models. However, when transferring them to
the task of video retrieval, models still need to be fine-tuned on hand-curated
paired text-video data to adapt to the diverse styles of video descriptions. To
address this problem without the need for hand-annotated pairs, we propose a
new setting, text-video retrieval with uncurated & unpaired data, that during
training utilizes only text queries together with uncurated web videos without
any paired text-video data. To this end, we propose an approach, In-Style, that
learns the style of the text queries and transfers it to uncurated web videos.
Moreover, to improve generalization, we show that one model can be trained with
multiple text styles. To this end, we introduce a multi-style contrastive
training procedure that improves the generalizability over several datasets
simultaneously. We evaluate our model on retrieval performance over multiple
datasets to demonstrate the advantages of our style transfer framework on the
new task of uncurated & unpaired text-video retrieval and improve
state-of-the-art performance on zero-shot text-video retrieval.Comment: Published at ICCV 2023, code: https://github.com/ninatu/in_styl
Learning by Sorting: Self-supervised Learning with Group Ordering Constraints
Contrastive learning has become an important tool in learning representations
from unlabeled data mainly relying on the idea of minimizing distance between
positive data pairs, e.g., views from the same images, and maximizing distance
between negative data pairs, e.g., views from different images. This paper
proposes a new variation of the contrastive learning objective, Group Ordering
Constraints (GroCo), that leverages the idea of sorting the distances of
positive and negative pairs and computing the respective loss based on how many
positive pairs have a larger distance than the negative pairs, and thus are not
ordered correctly. To this end, the GroCo loss is based on differentiable
sorting networks, which enable training with sorting supervision by matching a
differentiable permutation matrix, which is produced by sorting a given set of
scores, to a respective ground truth permutation matrix. Applying this idea to
groupwise pre-ordered inputs of multiple positive and negative pairs allows
introducing the GroCo loss with implicit emphasis on strong positives and
negatives, leading to better optimization of the local neighborhood. We
evaluate the proposed formulation on various self-supervised learning
benchmarks and show that it not only leads to improved results compared to
vanilla contrastive learning but also shows competitive performance to
comparable methods in linear probing and outperforms current methods in k-NN
performance.Comment: Published at ICCV 2023, Code @
https://github.com/ninatu/learning_by_sortin
Preserving Modality Structure Improves Multi-Modal Learning
Self-supervised learning on large-scale multi-modal datasets allows learning
semantically meaningful embeddings in a joint multi-modal representation space
without relying on human annotations. These joint embeddings enable zero-shot
cross-modal tasks like retrieval and classification. However, these methods
often struggle to generalize well on out-of-domain data as they ignore the
semantic structure present in modality-specific embeddings. In this context, we
propose a novel Semantic-Structure-Preserving Consistency approach to improve
generalizability by preserving the modality-specific relationships in the joint
embedding space. To capture modality-specific semantic relationships between
samples, we propose to learn multiple anchors and represent the multifaceted
relationship between samples with respect to their relationship with these
anchors. To assign multiple anchors to each sample, we propose a novel
Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates
that our proposed approach learns semantically meaningful anchors in a
self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2
datasets demonstrates that our proposed multi-anchor assignment based solution
achieves state-of-the-art performance and generalizes to both inand
out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_KnoppComment: Accepted at ICCV 202
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables
remarkable progress in zero-shot recognition, image generation & editing, and
many other exciting tasks. However, VL models tend to over-represent objects
while paying much less attention to verbs, and require additional tuning on
video data for best zero-shot action recognition performance. While previous
work relied on large-scale, fully-annotated data, in this work we propose an
unsupervised approach. We adapt a VL model for zero-shot and few-shot action
recognition using a collection of unlabeled videos and an unpaired action
dictionary. Based on that, we leverage Large Language Models and VL models to
build a text bag for each unlabeled video via matching, text expansion and
captioning. We use those bags in a Multiple Instance Learning setup to adapt an
image-text backbone to video data. Although finetuned on unlabeled video data,
our resulting models demonstrate high transferability to numerous unseen
zero-shot downstream tasks, improving the base VL model performance by up to
14\%, and even comparing favorably to fully-supervised baselines in both
zero-shot and few-shot video recognition transfer. The code will be released
later at \url{https://github.com/wlin-at/MAXI}.Comment: Accepted at ICCV 202
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Multilingual text-video retrieval methods have improved significantly in
recent years, but the performance for other languages lags behind English. We
propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve
multilingual text-video retrieval. Inspired by the fact that English text-video
retrieval outperforms other languages, we train a student model using input
text in different languages to match the cross-modal predictions from teacher
models using input text in English. We propose a cross entropy based objective
which forces the distribution over the student's text-video similarity scores
to be similar to those of the teacher models. We introduce a new multilingual
video dataset, Multi-YouCook2, by translating the English captions in the
YouCook2 video dataset to 8 other languages. Our method improves multilingual
text-video retrieval performance on Multi-YouCook2 and several other datasets
such as Multi-MSRVTT and VATEX. We also conducted an analysis on the
effectiveness of different multilingual text models as teachers
Генезис та еволюція поняття "інвестиції"
The paper shows evolution of scientific views in determining the intension of 'investment' and its role in the development of economy. The main regularities of the historical process related to formation and development of scientific knowledge of investment are examined. The author ascertains the interrelation between scientific theoretical conceptions and their impact on transformation of economic processes.There are four stages in formation of scientific knowledge of the essence and meaning of the category 'investment' identified:Stage 1: from the first commercial relations to great geological discoveries. At that stage the concept of 'investment' was not yet introduced in science and was used merely to define the major part of a loan i.e. the credit body;Stage 2: the 17th - 19th centuries (the Schools of Mercantilists and Physiacrats). The category of 'investment' was identified with money and wealth, which can be augmented through commerce, and subsequently with the concept of 'capital' to be channeled for export-oriented production;Stage 3: the 19th century - the second half of the 20th century (the Classical School, marginalism, Marxism, the Neoclassical School) saw the formation of a complex theory of investment, understanding of the investment process gained and a comprehensive capital market performance model developed: there were factors influencing the saving and investment processes identified. The interrelation of such categories as 'saving', 'interest' and 'investment' was substantiated;Stage 4: the second half of the 20th century - up to date (Keynesianism, institutionalism, neo-Keynesian theory). At that development stage of theoretical conceptions of investment the methodology for analyzing the investment process and evaluating the efficiency of investing and effects of investment in handling social, political and some other challenges was further developed.В статье показана эволюция научных взглядов на определение сущности понятия "инвестиции" и их роль в развитии экономики. Рассмотрены основные закономерности исторического процесса по формированию и развитию научных знаний об инвестициях. Определена взаимосвязь научно-теоретических представлений и их влияние на трансформацию экономических процессов.У статті показано еволюцію наукових поглядів на визначення сутності поняття "інвестиції" та їхньої ролі в розвитку економіки. Розглянуто основні закономірності історичного процесу щодо формування й розвитку наукових знань про інвестиції. Визначено взаємозв'язок науково-теоретичних уявлень та їхній вплив на трансформацію економічних процесів