280 research outputs found

    Implicit Temporal Modeling with Learnable Alignment for Video Recognition

    Full text link
    Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint spatial-temporal modeling trades off between the efficiency and performance. While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention. To this end, in this paper, we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. Specifically, for a frame pair, an interactive point is predicted in each frame, serving as a mutual information rich region. By enhancing the features around the interactive point, two frames are implicitly aligned. The aligned features are then pooled into a single token, which is leveraged in the subsequent spatial self-attention. Our method allows eliminating the costly or insufficient temporal self-attention in video. Extensive experiments on benchmarks demonstrate the superiority and generality of our module. Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is released at https://github.com/Francis-Rings/ILA .Comment: ICCV 2023 oral. 14 pages, 7 figures. Code released at https://github.com/Francis-Rings/IL

    MotionEditor: Editing Video Motion via Content-Aware Diffusion

    Full text link
    Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.Comment: 18 pages, 15 figures. Project page at https://francis-rings.github.io/MotionEditor

    SVFormer: Semi-supervised Video Transformer for Action Recognition

    Full text link
    Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks

    ResFormer: Scaling ViTs with Multi-Resolution Training

    Full text link
    Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range of resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition. Code is available at https://github.com/ruitian12/resformer.Comment: CVPR 202

    VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

    Full text link
    Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively. More examples can be found at our website https://ChenHsing.github.io/VIDiff

    Surgical Incision Induces Anxiety-Like Behavior and Amygdala Sensitization: Effects of Morphine and Gabapentin

    Get PDF
    The role of affective dimension in the postoperative pain is still poorly understood. The present study investigated the development of anxiety-like behavior and amygdala sensitization in incisional pain. Using hind-paw incision model in rats, we showed that surgical incision induced the anxiety-like behavior as determined by elevated plus-maze and open-field tests. Intraperitoneal (IP) morphine administration reversed mechanical allodynia and anxiety-like behavior in a dose-dependent manner. Gabapentin also partially reduced incision-evoked mechanical allodynia and anxiety-like behavior in a dose-dependent manner. After incision, the expression of phosphorylated cAMP response elements (CRE-) binding protein (p-CREB) was transiently upregulated in the central and basolateral nuclei in the bilateral amygdala. The upregulation of p-CREB was inhibited by morphine and gabapentin. The present study suggested that surgical incision could induce anxiety and amygdala sensitization that can be inhibited by morphine and gabapentin. Thus treatment of surgery-induced affective disturbances by morphine and gabapentin may be a potential important adjunct therapy in the postoperative pain management

    dd-wave Superconductivity, Pseudogap, and the Phase Diagram of tt-tt'-JJ Model at Finite Temperature

    Full text link
    Recently, a robust dd-wave superconductivity has been unveiled in the ground state of the 2D tt-tt'-JJ model -- with both nearest-neighbor (tt) and next-nearest-neighbor (tt') hoppings -- through the density matrix renormalization group calculations in the ground state. In this study, we exploit the state-of-the-art thermal tensor network approach to accurately simulate the finite-temperature electron states of the tt-tt'-JJ model on cylinders with widths up to W=6W=6. Our analysis suggests that in the dome-like superconducting phase, the dd-wave pairing susceptibility exhibits a divergent behavior with χSC1/Tα\chi_\textrm{SC} \propto 1/T^\alpha below the onset temperature TcT_c^*. Near the optimal doping, TcT_c^* reaches its highest value of about 0.05t0.05 t (0.15J\equiv 0.15 J). Above TcT_c^* yet below a higher crossover temperature TT^*, the magnetic susceptibility is suppressed, and the Fermi surface also exhibits node-antinode structure, resembling the pseudogap behaviors observed in cuprates. Our unbiased and accurate thermal tensor network calculations obtain the phase diagram of the tt-tt'-JJ model with t/t>0t'/t>0, shedding light on the dd-wave superconducting and pseudogap phases in the enigmatic cuprate phase diagram.Comment: 7+5 pages, 4+8 figure

    Adjuvant treatment for patients with incidentally resected limited disease small cell lung cancer-a retrospective study.

    Get PDF
    Background With the exception of very early-stage small cell lung cancer (SCLC), surgery is not typically recommended for this disease; however, incidental resection still occurs. After incidental resection, adjuvant salvage therapy is widely offered, but the evidence supporting its use is limited. This study aimed to explore proper adjuvant therapy for these incidentally resected SCLC cases. Methods Patients incidentally diagnosed with SCLC after surgery at the Shanghai Pulmonary Hospital in China from January 2005 to December 2014 were included in this study. The primary outcome was overall survival. Patients were classified into different group according to the type of adjuvant therapy they received and stratified by their pathological lymph node status. Patients' survival was analyzed using a Kaplan-Meier analysis and Cox regression analysis. Results A total of 161 patients were included in this study. Overall 5-year survival rate was 36.5%. For pathological N0 (pN0) cases (n=70), multivariable analysis revealed that adjuvant chemotherapy (ad-chemo) was associated with reduced risk of death [hazard ratio (HR): 0.373; 95% confidence interval (CI): 0.141-0.985, P=0.047] compared to omission of adjuvant therapy. For pathological N1 or N2 (pN1/2) cases (n=91), taking no adjuvant therapy cases as a reference, the multivariable analysis showed that ad-chemo was not associated with a lower risk of death (HR: 0.869; 95% CI: 0.459-1.645, P=0.666), while adjuvant chemo-radiotherapy (ad-CRT) was associated with a lower risk of death (HR: 0.279; 95% CI: 0.102-0.761, P=0.013). Conclusions Patients who incidentally receive surgical resection and are diagnosed with limited disease SCLC after resection should be offered adjuvant therapy as a salvage treatment. For incidentally resected pN0 cases, ad-chemo should be considered and for pN1/2 cases, ad-CRT should be received
    corecore