677 research outputs found
Open-Set Image Tagging with Multi-Grained Text Supervision
In this paper, we introduce the Recognize Anything Plus Model (RAM++), an
open-set image tagging model effectively leveraging multi-grained text
supervision. Previous approaches (e.g., CLIP) primarily utilize global text
supervision paired with images, leading to sub-optimal performance in
recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly
integrates individual tag supervision with global text supervision, all within
a unified alignment framework. This integration not only ensures efficient
recognition of predefined tag categories, but also enhances generalization
capabilities for diverse open-set categories. Furthermore, RAM++ employs large
language models (LLMs) to convert semantically constrained tag supervision into
more expansive tag description supervision, thereby enriching the scope of
open-set visual description concepts. Comprehensive evaluations on various
image recognition benchmarks demonstrate RAM++ exceeds existing
state-of-the-art (SOTA) open-set image tagging models on most aspects.
Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2
mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For
open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP
and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse
human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP
improvements on the HICO benchmark. Code, datasets and pre-trained models are
available at \url{https://github.com/xinyu1205/recognize-anything}.Comment: Homepage: https://github.com/xinyu1205/recognize-anythin
The call-by-value Lambda-Calculus with generalized applications
The lambda-calculus with generalized applications is the Curry-Howard counterpart to the system of natural deduction with generalized elimination rules for intuitionistic implicational logic. In this paper we identify a call-by-value variant of the system and prove confluence, strong normalization, and standardization. In the end, we show that the cbn and cbv variants of the system simulate each other via mappings based on extensions of the "protecting-by-a-lambda" compilation technique.FCT -Fundação para a Ciência e a Tecnologia(UID/MAT/00013/2013
As firm as their foundations: can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities among CLIP’s downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet alarmingly effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method can produce highly effective adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of publicly available foundational models in the development of downstream systems, calling for extra caution in these scenarios
MEWL: Few-shot multimodal word learning with referential uncertainty
Without explicit feedback, humans can rapidly learn the meaning of words.
Children can acquire a new word after just a few passive exposures, a process
known as fast mapping. This word learning capability is believed to be the most
fundamental building block of multimodal understanding and reasoning. Despite
recent advancements in multimodal learning, a systematic and rigorous
evaluation is still missing for human-like word learning in machines. To fill
in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess
how machines learn word meaning in grounded visual scenes. MEWL covers human's
core cognitive toolkits in word learning: cross-situational reasoning,
bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot
benchmark suite consisting of nine tasks for probing various word learning
capabilities. These tasks are carefully designed to be aligned with the
children's core abilities in word learning and echo the theories in the
developmental literature. By evaluating multimodal and unimodal agents'
performance with a comparative analysis of human performance, we notice a sharp
divergence in human and machine word learning. We further discuss these
differences between humans and machines and call for human-like few-shot word
learning in machines.Comment: Accepted at ICML 202
A Comparative Study of Pairwise Learning Methods based on Kernel Ridge Regression
Many machine learning problems can be formulated as predicting labels for a
pair of objects. Problems of that kind are often referred to as pairwise
learning, dyadic prediction or network inference problems. During the last
decade kernel methods have played a dominant role in pairwise learning. They
still obtain a state-of-the-art predictive performance, but a theoretical
analysis of their behavior has been underexplored in the machine learning
literature.
In this work we review and unify existing kernel-based algorithms that are
commonly used in different pairwise learning settings, ranging from matrix
filtering to zero-shot learning. To this end, we focus on closed-form efficient
instantiations of Kronecker kernel ridge regression. We show that independent
task kernel ridge regression, two-step kernel ridge regression and a linear
matrix filter arise naturally as a special case of Kronecker kernel ridge
regression, implying that all these methods implicitly minimize a squared loss.
In addition, we analyze universality, consistency and spectral filtering
properties. Our theoretical results provide valuable insights in assessing the
advantages and limitations of existing pairwise learning methods.Comment: arXiv admin note: text overlap with arXiv:1606.0427
Evaluating the transferability of personalised exercise recognition models.
Exercise Recognition (ExR) is relevant in many high impact domains, from health care to recreational activities to sports sciences. Like Human Activity Recognition (HAR), ExR faces many challenges when deployed in the real-world. For instance, typical lab performances of Machine Learning models, are hard to replicate, due to differences in personal nuances, traits and ambulatory rhythms. Thus effective transferability of a trained ExR model, depends on its ability to adapt and personalise to new users or user groups. This calls for new experimental design strategies that are also person-aware, and able to organise train and test data differently from standard ML practice. Speciffically, we look at person-agnostic and person-aware methods of train-test data creation, and compare them to identify best practices on a comparative study of personalised ExR model transfer. Our findings show that ExR when compared to results with other HAR tasks, to be a far more challenging personalisation problem and also confirms the utility of metric learning algorithms for personalised model transfer
- …