12 research outputs found

    On the Adversarial Robustness of Vision Transformers

    Full text link
    Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs). This observation also holds for certified robustness. We summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less low-level information and are more generalizable, which contributes to superior robustness against adversarial perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Increasing the proportion of transformers in the model structure (when the model consists of both transformer and CNN blocks) leads to better robustness. But for a pure transformer model, simply increasing the size or adding layers cannot guarantee a similar effect. 4) Pre-training on larger datasets does not significantly improve adversarial robustness though it is critical for training ViTs. 5) Adversarial training is also applicable to ViT for training robust models. Furthermore, feature visualization and frequency analysis are conducted for explanation. The results show that ViTs are less sensitive to high-frequency perturbations than CNNs and there is a high correlation between how well the model learns low-level features and its robustness against different frequency-based perturbations

    Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

    Full text link
    Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a 'change of basis' provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.Comment: ACL 202

    MPCFormer: fast, performant and private Transformer inference with MPC

    Full text link
    Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions for Transformers can increase the inference latency by more than 60x or significantly compromise the quality of inference results. In this paper, we design the framework MPCFORMER using secure multi-party computation (MPC) and Knowledge Distillation (KD). It can be used in tandem with many specifically designed MPC-friendly approximations and trained Transformer models. MPCFORMER significantly speeds up Transformer model inference in MPC settings while achieving similar ML performance to the input model. We evaluate MPCFORMER with various settings in MPC. On the IMDb dataset, we achieve similar performance to BERTBASE, while being 5.3x faster. On the GLUE benchmark, we achieve 97% performance of BERTBASE with a 2.2x speedup. We show that MPCFORMER remains effective with different trained Transformer weights such as ROBERTABASE and larger models including BERTLarge. In particular, we achieve similar performance to BERTLARGE, while being 5.93x faster on the IMDb dataset

    VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

    Full text link
    We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io

    Efficacy and safety of low-dose IL-2 in the treatment of systemic lupus erythematosus: A randomised, double-blind, placebo-controlled trial

    Get PDF
    Objectives Open-labelled clinical trials suggested that low-dose IL-2 might be effective in treatment of systemic lupus erythematosus (SLE). A double-blind and placebocontrolled trial is required to formally evaluate the safety and efficacy of low-dose IL-2 therapy. Methods A randomised, double-blind and placebocontrolled clinical trial was designed to treat 60 patients with active SLE. These patients received either IL-2 (n=30) or placebo (n=30) with standard treatment for 12 weeks, and were followed up for additional 12 weeks. IL-2 at a dose of 1 million IU or placebo was administered subcutaneously every other day for 2 weeks and followed by a 2-week break as one treatment cycle. The primary endpoint was the SLE Responder Index-4 (SRI-4) at week 12. The secondary endpoints were other clinical responses, safety and dynamics of immune cell subsets. Results At week 12, the SRI-4 response rates were 55.17% and 30.00% for IL-2 and placebo, respectively (p=0.052). At week 24, the SRI-4 response rate of IL-2 group was 65.52%, compared with 36.67% of the placebo group (p=0.027). The primary endpoint was not met at week 12. Low-dose IL-2 treatment resulted in 53.85% (7/13) complete remission in patients with lupus nephritis, compared with 16.67% (2/12) in the placebo group (p=0.036). No serious infection was observed in the IL-2 group, but two in placebo group. Besides expansion of regulatory T cells, low-dose IL-2 may also sustain cellular immunity with enhanced natural killer cells. Conclusions Low-dose IL-2 might be effective and tolerated in treatment of SThe work was supported by the National Natural Science Foundation of China (31530020,31570880,81471601,81601417 and 81701598), Peking-Tsinghua Center for Life Sciences to ZG LI, Beijing Sci-Tech Committee Z171100000417007,Clinical Medicine Plus X-Young Scholars Project of Peking University (PKU2019LCXQ013) supported by the Fundamental Research Funds for the Central Universities, Beijing Nova Program Z171100001117025, National Key Research and Development Program of China (2017YFC0909003 to DY), BellberryViertel Senior Medical Research Fellowship to DY and Beijing SL PHARM

    Tubeless video-assisted thoracic surgery for pulmonary ground-glass nodules: expert consensus and protocol (Guangzhou)

    Get PDF

    Stochastic Channel-Based Federated Learning With Neural Network Pruning for Medical Data Privacy Preservation: Model Development and Experimental Validation

    No full text
    Background: Artificial neural networks have achieved unprecedented success in the medical domain. This success depends on the availability of massive and representative datasets. However, data collection is often prevented by privacy concerns, and people want to take control over their sensitive information during both the training and using processes. Objective: To address security and privacy issues, we propose a privacy-preserving method for the analysis of distributed medical data. The proposed method, termed stochastic channel-based federated learning (SCBFL), enables participants to train a high-performance model cooperatively and in a distributed manner without sharing their inputs. Methods: We designed, implemented, and evaluated a channel-based update algorithm for a central server in a distributed system. The update algorithm will select the channels with regard to the most active features in a training loop, and then upload them as learned information from local datasets. A pruning process, which serves as a model accelerator, was further applied to the algorithm based on the validation set. Results: We constructed a distributed system consisting of 5 clients and 1 server. Our trials showed that the SCBFL method can achieve an area under the receiver operating characteristic curve (AUC-ROC) of 0.9776 and an area under the precision-recall curve (AUC-PR) of 0.9695 with only 10% of channels shared with the server. Compared with the federated averaging algorithm, the proposed SCBFL method achieved a 0.05388 higher AUC-ROC and 0.09695 higher AUC-PR. In addition, our experiment showed that 57% of the time is saved by the pruning process with only a reduction of 0.0047 in AUC-ROC performance and a reduction of 0.0068 in AUC-PR performance. Conclusions: In this experiment, our model demonstrated better performance and a higher saturating speed than the federated averaging method, which reveals all of the parameters of local models to the server. The saturation rate of performance could be promoted by introducing a pruning process and further improvement could be achieved by tuning the pruning rate
    corecore