14 research outputs found
MaPLe: Multi-modal Prompt Learning
Pre-trained vision-language (V-L) models such as CLIP have shown excellent
generalization ability to downstream tasks. However, they are sensitive to the
choice of input text prompts and require careful selection of prompt templates
to perform well. Inspired by the Natural Language Processing (NLP) literature,
recent CLIP adaptation approaches learn prompts as the textual inputs to
fine-tune CLIP for downstream tasks. We note that using prompting to adapt
representations in a single branch of CLIP (language or vision) is sub-optimal
since it does not allow the flexibility to dynamically adjust both
representation spaces on a downstream task. In this work, we propose
Multi-modal Prompt Learning (MaPLe) for both vision and language branches to
improve alignment between the vision and language representations. Our design
promotes strong coupling between the vision-language prompts to ensure mutual
synergy and discourages learning independent uni-modal solutions. Further, we
learn separate prompts across different early stages to progressively model the
stage-wise feature relationships to allow rich context learning. We evaluate
the effectiveness of our approach on three representative tasks of
generalization to novel classes, new target datasets and unseen domain shifts.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable
performance and achieves an absolute gain of 3.45% on novel classes and 2.72%
on overall harmonic-mean, averaged over 11 diverse image recognition datasets.
Our code and pre-trained models are available at
https://github.com/muzairkhattak/multimodal-prompt-learning.Comment: Accepted at CVPR202
Fine-tuned CLIP Models are Efficient Video Learners
Large-scale multi-modal training with image-text pairs imparts strong
generalization to CLIP model. Since training on a similar scale for videos is
infeasible, recent approaches focus on the effective transfer of image-based
CLIP to the video domain. In this pursuit, new parametric modules are added to
learn temporal information and inter-frame relationships which require
meticulous design efforts. Furthermore, when the resulting models are learned
on videos, they tend to overfit on the given task distribution and lack in
generalization aspect. This begs the following question: How to effectively
transfer image-level CLIP representations to videos? In this work, we show that
a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to
bridge the domain gap from images to videos. Our qualitative analysis
illustrates that the frame-level processing from CLIP image-encoder followed by
feature pooling and similarity matching with corresponding text embeddings
helps in implicitly modeling the temporal cues within ViFi-CLIP. Such
fine-tuning helps the model to focus on scene dynamics, moving objects and
inter-object relationships. For low-data regimes where full fine-tuning is not
viable, we propose a `bridge and prompt' approach that first uses fine-tuning
to bridge the domain gap and then learns prompts on language and vision side to
adapt CLIP representations. We extensively evaluate this simple yet strong
baseline on zero-shot, base-to-novel generalization, few-shot and fully
supervised settings across five video benchmarks. Our code is available at
https://github.com/muzairkhattak/ViFi-CLIP.Comment: Accepted at CVPR 202
Learning to Prompt with Text Only Supervision for Vision-Language Models
Foundational vision-language models such as CLIP are becoming a new paradigm
in vision, due to their excellent generalization abilities. However, adapting
these models for downstream tasks while maintaining their generalization
remains a challenge. In literature, one branch of methods adapts CLIP by
learning prompts using visual information. While effective, most of these works
require labeled data which is not practical, and often struggle to generalize
towards new datasets due to over-fitting on the source data. An alternative
approach resorts to training-free methods by generating class descriptions from
large language models (LLMs) and perform prompt ensembling. However, these
methods often generate class specific prompts that cannot be transferred to
other classes, which incur higher costs by generating LLM descriptions for each
class separately. In this work, we propose to combine the strengths of these
both streams of methods by learning prompts using only text data derived from
LLMs. As supervised training of prompts is not trivial due to absence of
images, we develop a training approach that allows prompts to extract rich
contextual knowledge from LLM data. Moreover, with LLM contextual data mapped
within the learned prompts, it enables zero-shot transfer of prompts to new
classes and datasets potentially cutting the LLM prompt engineering cost. To
the best of our knowledge, this is the first work that learns generalized
prompts using text only data. We perform extensive evaluations on 4 benchmarks
where our method improves over prior ensembling works while being competitive
to those utilizing labeled images. Our code and pre-trained models are
available at https://github.com/muzairkhattak/ProText.Comment: Project Page: https://muzairkhattak.github.io/ProText
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Recent video recognition models utilize Transformer models for long-range
spatio-temporal context modeling. Video transformer designs are based on
self-attention that can model global context at a high computational cost. In
comparison, convolutional designs for videos offer an efficient alternative but
lack long-range dependency modeling. Towards achieving the best of both
designs, this work proposes Video-FocalNet, an effective and efficient
architecture for video recognition that models both local and global contexts.
Video-FocalNet is based on a spatio-temporal focal modulation architecture that
reverses the interaction and aggregation steps of self-attention for better
efficiency. Further, the aggregation step and the interaction step are both
implemented using efficient convolution and element-wise multiplication
operations that are computationally less expensive than their self-attention
counterparts on video representations. We extensively explore the design space
of focal modulation-based spatio-temporal context modeling and demonstrate our
parallel spatial and temporal encoding design to be the optimal choice.
Video-FocalNets perform favorably well against the state-of-the-art
transformer-based models for video recognition on five large-scale datasets
(Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower
computational cost. Our code/models are released at
https://github.com/TalalWasim/Video-FocalNets.Comment: Accepted to ICCV-2023. Camera-Ready version. Project page:
https://TalalWasim.github.io/Video-FocalNets
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
The promising zero-shot generalization of vision-language models such as CLIP
has led to their adoption using prompt learning for numerous downstream tasks.
Previous works have shown test-time prompt tuning using entropy minimization to
adapt text prompts for unseen domains. While effective, this overlooks the key
cause for performance degradation to unseen domains -- distribution shift. In
this work, we explicitly handle this problem by aligning the
out-of-distribution (OOD) test sample statistics to those of the source data
using prompt tuning. We use a single test sample to adapt multi-modal prompts
at test time by minimizing the feature distribution shift to bridge the gap in
the test domain. Evaluating against the domain generalization benchmark, our
method improves zero-shot top- 1 accuracy beyond existing prompt-learning
techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset
generalization with unseen categories across 10 datasets, our method improves
consistently across all datasets compared to the existing state-of-the-art. Our
source code and models are available at
https://jameelhassan.github.io/promptalign.Comment: Accepted to NeurIPS 202
Effects of Neem (Azadirachta indica) seed and Turmeric (Curcuma longa) rhizome extracts on aphids control, plant growth and yield in okra
The use of synthetic pesticides to control pests and increase crops yield is a common practice, but they cause several environmental and health problems. Therefore, there is a need to explore alternative approaches to reduce the sole dependence on synthetic pesticides. The present study was conducted to screen the extracts of Neem seed and Turmeric rhizome for pesticidal activities against okra pests (aphids). Experiments were conducted in field with four plots. One plot was kept as a control (unsprayed) and one was sprayed with synthetic pesticides, one with Neem seeds extract and one with Turmeric rhizome extract. The effect on number of pests, plant growth and yield was observed at regular intervals. A significant reduction in pests was recorded in all treatments as compared to the control. Neem seed extract was more effective than Turmeric rhizome extract as revealed by a 73% decrease in aphids by Neem extract in comparison to 54% by Turmeric extract after last application. Both the extracts were found to be more effective than the synthetic pesticides in controlling okra pests. Both the extracts had stimulatory effects on okra growth and yield. For example, the total yield of plots sprayed with Neem (53.3 kg plot-1) and Turmeric extract (47.7 kg plot-1) was higher than the yield of control plot (33.8 kg plot-1) and plot sprayed with synthetic pesticides (39 kg plot-1). It is concluded that Neem and Turmeric extracts can be used as alternative of synthetic pesticides for controlling pests attacks in okra
Public perception and willingness towards bystander cardiopulmonary resuscitation (CPR) training and performance in Pakistan
Background/Aim: Bystander cardiopulmonary resuscitation (CPR) during out of-hospital cardiac arrest increases both survival rates and neurological recovery, but in Pakistan, an alarmingly low 2.3 % of these individuals receive bystander CPR. This study was designed to identify the reasons that affect the perception and willingness of the public toward bystander CPR training and performance in Lahore, Pakistan. Methods: A CPR master trainer from the USA visited various organisations from 1 December 2022 to 31 January 2023, to conduct training sessions. Before and after the training, a questionnaire was distributed among respondents to fill in. The subjects were asked to answer questions about their perception and willingness to perform bystander CPR. Results: Out of 401 participants, 240 completed the survey, with a response rate of 59.85 %. The majority of them were males [146 (60.8 %)], 215 (89.6 %) were below the age of 40, 107 (44.6 %) were graduated, 182 (75.8 %) never participated in any CPR training, mainly due to their ignorance towards the importance of bystander CPR (52.8 %) and 152 (63.3 %) were eager to participate in the CPR training course. Furthermore, the leading problem in providing bystander CPR was lack of technique or fear of possible harm that can be proved fatal (48.8 %), followed by concerns related to involvement in any legal procedure (10.0 %). Conclusions: Bystander CPR is still uncommon in Pakistan. Participants were reluctant to perform bystander CPR because of various concerns and fears. Lack of proper skill and causing additional harm were the main reasons associated with this. Hence, while improving CPR training and public education, these findings must be considered
Impact of opioid-free analgesia on pain severity and patient satisfaction after discharge from surgery: multispecialty, prospective cohort study in 25 countries
Background: Balancing opioid stewardship and the need for adequate analgesia following discharge after surgery is challenging. This study aimed to compare the outcomes for patients discharged with opioid versus opioid-free analgesia after common surgical procedures.Methods: This international, multicentre, prospective cohort study collected data from patients undergoing common acute and elective general surgical, urological, gynaecological, and orthopaedic procedures. The primary outcomes were patient-reported time in severe pain measured on a numerical analogue scale from 0 to 100% and patient-reported satisfaction with pain relief during the first week following discharge. Data were collected by in-hospital chart review and patient telephone interview 1 week after discharge.Results: The study recruited 4273 patients from 144 centres in 25 countries; 1311 patients (30.7%) were prescribed opioid analgesia at discharge. Patients reported being in severe pain for 10 (i.q.r. 1-30)% of the first week after discharge and rated satisfaction with analgesia as 90 (i.q.r. 80-100) of 100. After adjustment for confounders, opioid analgesia on discharge was independently associated with increased pain severity (risk ratio 1.52, 95% c.i. 1.31 to 1.76; P < 0.001) and re-presentation to healthcare providers owing to side-effects of medication (OR 2.38, 95% c.i. 1.36 to 4.17; P = 0.004), but not with satisfaction with analgesia (beta coefficient 0.92, 95% c.i. -1.52 to 3.36; P = 0.468) compared with opioid-free analgesia. Although opioid prescribing varied greatly between high-income and low- and middle-income countries, patient-reported outcomes did not.Conclusion: Opioid analgesia prescription on surgical discharge is associated with a higher risk of re-presentation owing to side-effects of medication and increased patient-reported pain, but not with changes in patient-reported satisfaction. Opioid-free discharge analgesia should be adopted routinely
Mortality from gastrointestinal congenital anomalies at 264 hospitals in 74 low-income, middle-income, and high-income countries: a multicentre, international, prospective cohort study
Summary
Background Congenital anomalies are the fifth leading cause of mortality in children younger than 5 years globally.
Many gastrointestinal congenital anomalies are fatal without timely access to neonatal surgical care, but few studies
have been done on these conditions in low-income and middle-income countries (LMICs). We compared outcomes of
the seven most common gastrointestinal congenital anomalies in low-income, middle-income, and high-income
countries globally, and identified factors associated with mortality.
Methods We did a multicentre, international prospective cohort study of patients younger than 16 years, presenting to
hospital for the first time with oesophageal atresia, congenital diaphragmatic hernia, intestinal atresia, gastroschisis,
exomphalos, anorectal malformation, and Hirschsprung’s disease. Recruitment was of consecutive patients for a
minimum of 1 month between October, 2018, and April, 2019. We collected data on patient demographics, clinical
status, interventions, and outcomes using the REDCap platform. Patients were followed up for 30 days after primary
intervention, or 30 days after admission if they did not receive an intervention. The primary outcome was all-cause,
in-hospital mortality for all conditions combined and each condition individually, stratified by country income status.
We did a complete case analysis.
Findings We included 3849 patients with 3975 study conditions (560 with oesophageal atresia, 448 with congenital
diaphragmatic hernia, 681 with intestinal atresia, 453 with gastroschisis, 325 with exomphalos, 991 with anorectal
malformation, and 517 with Hirschsprung’s disease) from 264 hospitals (89 in high-income countries, 166 in middleincome
countries, and nine in low-income countries) in 74 countries. Of the 3849 patients, 2231 (58·0%) were male.
Median gestational age at birth was 38 weeks (IQR 36–39) and median bodyweight at presentation was 2·8 kg (2·3–3·3).
Mortality among all patients was 37 (39·8%) of 93 in low-income countries, 583 (20·4%) of 2860 in middle-income
countries, and 50 (5·6%) of 896 in high-income countries (p<0·0001 between all country income groups).
Gastroschisis had the greatest difference in mortality between country income strata (nine [90·0%] of ten in lowincome
countries, 97 [31·9%] of 304 in middle-income countries, and two [1·4%] of 139 in high-income countries;
p≤0·0001 between all country income groups). Factors significantly associated with higher mortality for all patients
combined included country income status (low-income vs high-income countries, risk ratio 2·78 [95% CI 1·88–4·11],
p<0·0001; middle-income vs high-income countries, 2·11 [1·59–2·79], p<0·0001), sepsis at presentation (1·20
[1·04–1·40], p=0·016), higher American Society of Anesthesiologists (ASA) score at primary intervention
(ASA 4–5 vs ASA 1–2, 1·82 [1·40–2·35], p<0·0001; ASA 3 vs ASA 1–2, 1·58, [1·30–1·92], p<0·0001]), surgical safety
checklist not used (1·39 [1·02–1·90], p=0·035), and ventilation or parenteral nutrition unavailable when needed
(ventilation 1·96, [1·41–2·71], p=0·0001; parenteral nutrition 1·35, [1·05–1·74], p=0·018). Administration of
parenteral nutrition (0·61, [0·47–0·79], p=0·0002) and use of a peripherally inserted central catheter (0·65
[0·50–0·86], p=0·0024) or percutaneous central line (0·69 [0·48–1·00], p=0·049) were associated with lower mortality.
Interpretation Unacceptable differences in mortality exist for gastrointestinal congenital anomalies between lowincome,
middle-income, and high-income countries. Improving access to quality neonatal surgical care in LMICs will
be vital to achieve Sustainable Development Goal 3.2 of ending preventable deaths in neonates and children younger
than 5 years by 2030
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Existing open-vocabulary object detectors typically enlarge their vocabulary
sizes by leveraging different forms of weak supervision. This helps generalize
to novel objects at inference. Two popular forms of weak-supervision used in
open-vocabulary detection (OVD) include pretrained CLIP model and image-level
supervision. We note that both these modes of supervision are not optimally
aligned for the detection task: CLIP is trained with image-text pairs and lacks
precise localization of objects while the image-level supervision has been used
with heuristics that do not accurately specify local object regions. In this
work, we propose to address this problem by performing object-centric alignment
of the language embeddings from the CLIP model. Furthermore, we visually ground
the objects with only image-level supervision using a pseudo-labeling process
that provides high-quality object proposals and helps expand the vocabulary
during training. We establish a bridge between the above two object-alignment
strategies via a novel weight transfer function that aggregates their
complimentary strengths. In essence, the proposed model seeks to minimize the
gap between object and image-centric representations in the OVD setting. On the
COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an
absolute 11.9 gain over the previous best performance.For LVIS, we surpass the
state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall.
Code: https://bit.ly/3byZoQp