12 research outputs found
On the Adversarial Robustness of Vision Transformers
Following the success in advancing natural language processing and
understanding, transformers are expected to bring revolutionary changes to
computer vision. This work provides the first and comprehensive study on the
robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs
possess better adversarial robustness when compared with convolutional neural
networks (CNNs). This observation also holds for certified robustness. We
summarize the following main observations contributing to the improved
robustness of ViTs:
1) Features learned by ViTs contain less low-level information and are more
generalizable, which contributes to superior robustness against adversarial
perturbations.
2) Introducing convolutional or tokens-to-token blocks for learning low-level
features in ViTs can improve classification accuracy but at the cost of
adversarial robustness.
3) Increasing the proportion of transformers in the model structure (when the
model consists of both transformer and CNN blocks) leads to better robustness.
But for a pure transformer model, simply increasing the size or adding layers
cannot guarantee a similar effect.
4) Pre-training on larger datasets does not significantly improve adversarial
robustness though it is critical for training ViTs.
5) Adversarial training is also applicable to ViT for training robust models.
Furthermore, feature visualization and frequency analysis are conducted for
explanation. The results show that ViTs are less sensitive to high-frequency
perturbations than CNNs and there is a high correlation between how well the
model learns low-level features and its robustness against different
frequency-based perturbations
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Despite recent progress towards scaling up multimodal vision-language models,
these models are still known to struggle on compositional generalization
benchmarks such as Winoground. We find that a critical component lacking from
current vision-language models is relation-level alignment: the ability to
match directional semantic relations in text (e.g., "mug in grass") with
spatial relationships in the image (e.g., the position of the mug relative to
the grass). To tackle this problem, we show that relation alignment can be
enforced by encouraging the directed language attention from 'mug' to 'grass'
(capturing the semantic relation 'in') to match the directed visual attention
from the mug to the grass. Tokens and their corresponding objects are softly
identified using the cross-modal attention. We prove that this notion of soft
relation alignment is equivalent to enforcing congruence between vision and
language attention matrices under a 'change of basis' provided by the
cross-modal attention matrix. Intuitively, our approach projects visual
attention into the language attention space to calculate its divergence from
the actual language attention, and vice versa. We apply our Cross-modal
Attention Congruence Regularization (CACR) loss to UNITER and improve on the
state-of-the-art approach to Winoground.Comment: ACL 202
MPCFormer: fast, performant and private Transformer inference with MPC
Enabling private inference is crucial for many cloud inference services that
are based on Transformer models. However, existing private inference solutions
for Transformers can increase the inference latency by more than 60x or
significantly compromise the quality of inference results. In this paper, we
design the framework MPCFORMER using secure multi-party computation (MPC) and
Knowledge Distillation (KD). It can be used in tandem with many specifically
designed MPC-friendly approximations and trained Transformer models. MPCFORMER
significantly speeds up Transformer model inference in MPC settings while
achieving similar ML performance to the input model. We evaluate MPCFORMER with
various settings in MPC. On the IMDb dataset, we achieve similar performance to
BERTBASE, while being 5.3x faster. On the GLUE benchmark, we achieve 97%
performance of BERTBASE with a 2.2x speedup. We show that MPCFORMER remains
effective with different trained Transformer weights such as ROBERTABASE and
larger models including BERTLarge. In particular, we achieve similar
performance to BERTLARGE, while being 5.93x faster on the IMDb dataset
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for
evaluation of instruction-following vision-language models for real-world use.
Our starting point is curating 70 'instruction families' that we envision
instruction tuned vision-language models should be able to address. Extending
beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to
game playing and creative generation. Following curation, our dataset comprises
592 test queries, each with a human-authored instruction-conditioned caption.
These descriptions surface instruction-specific factors, e.g., for an
instruction asking about the accessibility of a storefront for wheelchair
users, the instruction-conditioned caption describes ramps/potential obstacles.
These descriptions enable 1) collecting human-verified reference outputs for
each instance; and 2) automatic evaluation of candidate multimodal generations
using a text-only LLM, aligning with human judgment. We quantify quality gaps
between models and references using both human and automatic evaluations; e.g.,
the top-performing instruction-following model wins against the GPT-4 reference
in just 27% of the comparison. VisIT-Bench is dynamic to participate,
practitioners simply submit their model's response on the project website;
Data, code and leaderboard is available at visit-bench.github.io
Efficacy and safety of low-dose IL-2 in the treatment of systemic lupus erythematosus: A randomised, double-blind, placebo-controlled trial
Objectives Open-labelled clinical trials suggested that
low-dose IL-2 might be effective in treatment of systemic
lupus erythematosus (SLE). A double-blind and placebocontrolled trial is required to formally evaluate the safety and efficacy of low-dose IL-2 therapy.
Methods A randomised, double-blind and placebocontrolled
clinical trial was designed to treat 60 patients
with active SLE. These patients received either IL-2
(n=30) or placebo (n=30) with standard treatment
for 12 weeks, and were followed up for additional 12
weeks. IL-2 at a dose of 1 million IU or placebo was
administered subcutaneously every other day for 2 weeks
and followed by a 2-week break as one treatment cycle.
The primary endpoint was the SLE Responder Index-4
(SRI-4) at week 12. The secondary endpoints were other
clinical responses, safety and dynamics of immune cell
subsets.
Results At week 12, the SRI-4 response rates were
55.17% and 30.00% for IL-2 and placebo, respectively
(p=0.052). At week 24, the SRI-4 response rate of IL-2
group was 65.52%, compared with 36.67% of the
placebo group (p=0.027). The primary endpoint was not
met at week 12. Low-dose IL-2 treatment resulted in
53.85% (7/13) complete remission in patients with lupus
nephritis, compared with 16.67% (2/12) in the placebo
group (p=0.036). No serious infection was observed
in the IL-2 group, but two in placebo group. Besides
expansion of regulatory T cells, low-dose IL-2 may also
sustain cellular immunity with enhanced natural killer
cells.
Conclusions Low-dose IL-2 might be effective and tolerated in treatment of SThe work was supported by the National Natural Science Foundation
of China (31530020,31570880,81471601,81601417 and 81701598),
Peking-Tsinghua Center for Life Sciences to ZG LI, Beijing Sci-Tech Committee
Z171100000417007,Clinical Medicine Plus X-Young Scholars Project of Peking
University (PKU2019LCXQ013) supported by the Fundamental Research Funds for
the Central Universities, Beijing Nova Program Z171100001117025, National Key
Research and Development Program of China (2017YFC0909003 to DY), BellberryViertel Senior Medical Research Fellowship to DY and Beijing SL PHARM
Tubeless video-assisted thoracic surgery for pulmonary ground-glass nodules: expert consensus and protocol (Guangzhou)
Stochastic Channel-Based Federated Learning With Neural Network Pruning for Medical Data Privacy Preservation: Model Development and Experimental Validation
Background: Artificial neural networks have achieved unprecedented success in the medical domain. This success depends on the availability of massive and representative datasets. However, data collection is often prevented by privacy concerns, and people want to take control over their sensitive information during both the training and using processes. Objective: To address security and privacy issues, we propose a privacy-preserving method for the analysis of distributed medical data. The proposed method, termed stochastic channel-based federated learning (SCBFL), enables participants to train a high-performance model cooperatively and in a distributed manner without sharing their inputs. Methods: We designed, implemented, and evaluated a channel-based update algorithm for a central server in a distributed system. The update algorithm will select the channels with regard to the most active features in a training loop, and then upload them as learned information from local datasets. A pruning process, which serves as a model accelerator, was further applied to the algorithm based on the validation set. Results: We constructed a distributed system consisting of 5 clients and 1 server. Our trials showed that the SCBFL method can achieve an area under the receiver operating characteristic curve (AUC-ROC) of 0.9776 and an area under the precision-recall curve (AUC-PR) of 0.9695 with only 10% of channels shared with the server. Compared with the federated averaging algorithm, the proposed SCBFL method achieved a 0.05388 higher AUC-ROC and 0.09695 higher AUC-PR. In addition, our experiment showed that 57% of the time is saved by the pruning process with only a reduction of 0.0047 in AUC-ROC performance and a reduction of 0.0068 in AUC-PR performance. Conclusions: In this experiment, our model demonstrated better performance and a higher saturating speed than the federated averaging method, which reveals all of the parameters of local models to the server. The saturation rate of performance could be promoted by introducing a pruning process and further improvement could be achieved by tuning the pruning rate