7 research outputs found

    H.R. 2863 (117th Congress) – First-Time Homebuyer Act of 2021

    Get PDF

    In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

    Full text link
    Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.Comment: Published at ICCV 2023, code: https://github.com/ninatu/in_styl

    Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

    Full text link
    Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo), that leverages the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly. To this end, the GroCo loss is based on differentiable sorting networks, which enable training with sorting supervision by matching a differentiable permutation matrix, which is produced by sorting a given set of scores, to a respective ground truth permutation matrix. Applying this idea to groupwise pre-ordered inputs of multiple positive and negative pairs allows introducing the GroCo loss with implicit emphasis on strong positives and negatives, leading to better optimization of the local neighborhood. We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance.Comment: Published at ICCV 2023, Code @ https://github.com/ninatu/learning_by_sortin

    Preserving Modality Structure Improves Multi-Modal Learning

    Full text link
    Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_KnoppComment: Accepted at ICCV 202

    MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

    Full text link
    Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}.Comment: Accepted at ICCV 202

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Full text link
    Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers

    Генезис та еволюція поняття "інвестиції"

    No full text
    The paper shows evolution of scientific views in determining the intension of 'investment' and its role in the development of economy. The main regularities of the historical process related to formation and development of scientific knowledge of investment are examined. The author ascertains the interrelation between scientific theoretical conceptions and their impact on transformation of economic processes.There are four stages in formation of scientific knowledge of the essence and meaning of the category 'investment' identified:Stage 1: from the first commercial relations to great geological discoveries. At that stage the concept of 'investment' was not yet introduced in science and was used merely to define the major part of a loan i.e. the credit body;Stage 2: the 17th - 19th centuries (the Schools of Mercantilists and Physiacrats). The category of 'investment' was identified with money and wealth, which can be augmented through commerce, and subsequently with the concept of 'capital' to be channeled for export-oriented production;Stage 3: the 19th century - the second half of the 20th century (the Classical School, marginalism, Marxism, the Neoclassical School) saw the formation of a complex theory of investment, understanding of the investment process gained and a comprehensive capital market performance model developed: there were factors influencing the saving and investment processes identified. The interrelation of such categories as 'saving', 'interest' and 'investment' was substantiated;Stage 4: the second half of the 20th century - up to date (Keynesianism, institutionalism, neo-Keynesian theory). At that development stage of theoretical conceptions of investment the methodology for analyzing the investment process and evaluating the efficiency of investing and effects of investment in handling social, political and some other challenges was further developed.В статье показана эволюция научных взглядов на определение сущности понятия "инвестиции" и их роль в развитии экономики. Рассмотрены основные закономерности исторического процесса по формированию и развитию научных знаний об инвестициях. Определена взаимосвязь научно-теоретических представлений и их влияние на трансформацию экономических процессов.У статті показано еволюцію наукових поглядів на визначення сутності поняття "інвестиції" та їхньої ролі в розвитку економіки. Розглянуто основні закономірності історичного процесу щодо формування й розвитку наукових знань про інвестиції. Визначено взаємозв'язок науково-теоретичних уявлень та їхній вплив на трансформацію економічних процесів
    corecore