58 research outputs found

    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

    Full text link
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.htmlComment: Pre-print. 20 page

    An extensive program of periodic alternative splicing linked to cell cycle progression

    Get PDF
    Progression through the mitotic cell cycle requires periodic regulation of gene function at the levels of transcription, translation, protein-protein interactions, post-translational modification and degradation. However, the role of alternative splicing (AS) in the temporal control of cell cycle is not well understood. By sequencing the human transcriptome through two continuous cell cycles, we identify ~1300 genes with cell cycle-dependent AS changes. These genes are significantly enriched in functions linked to cell cycle control, yet they do not significantly overlap genes subject to periodic changes in steady-state transcript levels. Many of the periodically spliced genes are controlled by the SR protein kinase CLK1, whose level undergoes cell cycle-dependent fluctuations via an auto-inhibitory circuit. Disruption of CLK1 causes pleiotropic cell cycle defects and loss of proliferation, whereas CLK1 over-expression is associated with various cancers. These results thus reveal a large program of CLK1-regulated periodic AS intimately associated with cell cycle control

    Adaptive Split-Fusion Transformer

    Full text link
    Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former

    LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING

    Get PDF
    The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Therefore, video understanding has been one of the fundamental research topics in computer vision.Encouraged by the success of deep neural networks on image classification, many efforts have been made in recent years to extend the deep networks to video understanding. However, new challenges arise when the temporal characteristic of videos is taken into account. In this dissertation, we study two long-standing problems that play important roles in effective temporal modeling in videos: (1) How to extract motion information from raw video frames? (2) How to capture long-range dependencies in time and model their temporal dynamics? To address the above issues, we first introduce hierarchical contrastive motion learning, a novel self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features, from low-level pixel movements to higher-level semantic dynamics, in a fully self-supervised manner.Next, we investigate the self-attention mechanism for long-range temporal modeling, and demonstrate that the entangled modeling of spatio-temporal information fails to capture temporal relationships among frames explicitly. To this end, we propose Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. While the performance of video action recognition has been significantly improved by the aforementioned methods, they are still restricted to model temporal information within short clips. To overcome this limitation, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Finally, we present a spatio-temporal progressive learning framework (STEP) for spatio-temporal action detection. Our approach performs a multi-step optimization process that progressively refines the initial proposals towards the final solution. In this way, our approach can effectively make use of long-term temporal information by handling the spatial displacement problem in long action tubes

    Large-scale Data Analysis and Deep Learning Using Distributed Cyberinfrastructures and High Performance Computing

    Get PDF
    Data in many research fields continues to grow in both size and complexity. For instance, recent technological advances have caused an increased throughput in data in various biological-related endeavors, such as DNA sequencing, molecular simulations, and medical imaging. In addition, the variance in the types of data (textual, signal, image, etc.) adds an additional complexity in analyzing the data. As such, there is a need for uniquely developed applications that cater towards the type of data. Several considerations must be made when attempting to create a tool for a particular dataset. First, we must consider the type of algorithm required for analyzing the data. Next, since the size and complexity of the data imposes high computation and memory requirements, it is important to select a proper hardware environment on which to build the application. By carefully both developing the algorithm and selecting the hardware, we can provide an effective environment in which to analyze huge amounts of highly complex data in a large-scale manner. In this dissertation, I go into detail regarding my applications using big data and deep learning techniques to analyze complex and large data. I investigate how big data frameworks, such as Hadoop, can be applied to problems such as large-scale molecular dynamics simulations. Following this, many popular deep learning frameworks are evaluated and compared to find those that suit certain hardware setups and deep learning models. Then, we explore an application of deep learning to a biomedical problem, namely ADHD diagnosis from fMRI data. Lastly, I demonstrate a framework for real-time and fine-grained vehicle detection and classification. With each of these works in this dissertation, a unique large-scale analysis algorithm or deep learning model is implemented that caters towards the problem and leverages specialized computing resources
    • …
    corecore