677 research outputs found

    Open-Set Image Tagging with Multi-Grained Text Supervision

    Full text link
    In this paper, we introduce the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. Previous approaches (e.g., CLIP) primarily utilize global text supervision paired with images, leading to sub-optimal performance in recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly integrates individual tag supervision with global text supervision, all within a unified alignment framework. This integration not only ensures efficient recognition of predefined tag categories, but also enhances generalization capabilities for diverse open-set categories. Furthermore, RAM++ employs large language models (LLMs) to convert semantically constrained tag supervision into more expansive tag description supervision, thereby enriching the scope of open-set visual description concepts. Comprehensive evaluations on various image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) open-set image tagging models on most aspects. Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at \url{https://github.com/xinyu1205/recognize-anything}.Comment: Homepage: https://github.com/xinyu1205/recognize-anythin

    The call-by-value Lambda-Calculus with generalized applications

    Get PDF
    The lambda-calculus with generalized applications is the Curry-Howard counterpart to the system of natural deduction with generalized elimination rules for intuitionistic implicational logic. In this paper we identify a call-by-value variant of the system and prove confluence, strong normalization, and standardization. In the end, we show that the cbn and cbv variants of the system simulate each other via mappings based on extensions of the "protecting-by-a-lambda" compilation technique.FCT -Fundação para a Ciência e a Tecnologia(UID/MAT/00013/2013

    As firm as their foundations: can open-sourced foundation models be used to create adversarial examples for downstream tasks?

    Get PDF
    Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities among CLIP’s downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet alarmingly effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method can produce highly effective adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of publicly available foundational models in the development of downstream systems, calling for extra caution in these scenarios

    MEWL: Few-shot multimodal word learning with referential uncertainty

    Full text link
    Without explicit feedback, humans can rapidly learn the meaning of words. Children can acquire a new word after just a few passive exposures, a process known as fast mapping. This word learning capability is believed to be the most fundamental building block of multimodal understanding and reasoning. Despite recent advancements in multimodal learning, a systematic and rigorous evaluation is still missing for human-like word learning in machines. To fill in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess how machines learn word meaning in grounded visual scenes. MEWL covers human's core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot benchmark suite consisting of nine tasks for probing various word learning capabilities. These tasks are carefully designed to be aligned with the children's core abilities in word learning and echo the theories in the developmental literature. By evaluating multimodal and unimodal agents' performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning. We further discuss these differences between humans and machines and call for human-like few-shot word learning in machines.Comment: Accepted at ICML 202

    A Comparative Study of Pairwise Learning Methods based on Kernel Ridge Regression

    Full text link
    Many machine learning problems can be formulated as predicting labels for a pair of objects. Problems of that kind are often referred to as pairwise learning, dyadic prediction or network inference problems. During the last decade kernel methods have played a dominant role in pairwise learning. They still obtain a state-of-the-art predictive performance, but a theoretical analysis of their behavior has been underexplored in the machine learning literature. In this work we review and unify existing kernel-based algorithms that are commonly used in different pairwise learning settings, ranging from matrix filtering to zero-shot learning. To this end, we focus on closed-form efficient instantiations of Kronecker kernel ridge regression. We show that independent task kernel ridge regression, two-step kernel ridge regression and a linear matrix filter arise naturally as a special case of Kronecker kernel ridge regression, implying that all these methods implicitly minimize a squared loss. In addition, we analyze universality, consistency and spectral filtering properties. Our theoretical results provide valuable insights in assessing the advantages and limitations of existing pairwise learning methods.Comment: arXiv admin note: text overlap with arXiv:1606.0427

    Audio-Visual Egocentric Action Recognition

    Get PDF

    Evaluating the transferability of personalised exercise recognition models.

    Get PDF
    Exercise Recognition (ExR) is relevant in many high impact domains, from health care to recreational activities to sports sciences. Like Human Activity Recognition (HAR), ExR faces many challenges when deployed in the real-world. For instance, typical lab performances of Machine Learning models, are hard to replicate, due to differences in personal nuances, traits and ambulatory rhythms. Thus effective transferability of a trained ExR model, depends on its ability to adapt and personalise to new users or user groups. This calls for new experimental design strategies that are also person-aware, and able to organise train and test data differently from standard ML practice. Speciffically, we look at person-agnostic and person-aware methods of train-test data creation, and compare them to identify best practices on a comparative study of personalised ExR model transfer. Our findings show that ExR when compared to results with other HAR tasks, to be a far more challenging personalisation problem and also confirms the utility of metric learning algorithms for personalised model transfer
    • …
    corecore