2,231 research outputs found

    Object Detection in Videos with Tubelet Proposal Networks

    Full text link
    Object detection in videos has drawn increasing attention recently with the introduction of the large-scale ImageNet VID dataset. Different from object detection in static images, temporal information in videos is vital for object detection. To fully utilize temporal information, state-of-the-art methods are based on spatiotemporal tubelets, which are essentially sequences of associated bounding boxes across time. However, the existing methods have major limitations in generating tubelets in terms of quality and efficiency. Motion-based methods are able to obtain dense tubelets efficiently, but the lengths are generally only several frames, which is not optimal for incorporating long-term temporal information. Appearance-based methods, usually involving generic object tracking, could generate long tubelets, but are usually computationally expensive. In this work, we propose a framework for object detection in videos, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory (LSTM) network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of the proposed framework for object detection in videos.Comment: CVPR 201

    Minimal Diagnosis and Diagnosability of Discrete-Event Systems Modeled by Automata

    Get PDF
    In the last several decades, the model-based diagnosis of discrete-event systems (DESs) has increasingly become an active research topic in both control engineering and artificial intelligence. However, in contrast with the widely applied minimal diagnosis of static systems, in most approaches to the diagnosis of DESs, all possible candidate diagnoses are computed, including nonminimal candidates, which may cause intractable complexity when the number of nonminimal diagnoses is very large. According to the principle of parsimony and the principle of joint-probability distribution, generally, the minimal diagnosis of DESs is preferable to a nonminimal diagnosis. To generate more likely diagnoses, the notion of the minimal diagnosis of DESs is presented, which is supported by a minimal diagnoser for the generation of minimal diagnoses. Moreover, to either strongly or weakly decide whether a minimal set of faulty events has definitely occurred or not, two notions of minimal diagnosability are proposed. Necessary and sufficient conditions for determining the minimal diagnosability of DESs are proven. The relationships between the two types of minimal diagnosability and the classical diagnosability are analysed in depth

    Spherical contact mechanical analysis of roller cone drill bits journal bearing

    Get PDF
    AbstractFang contact model is introduced to analyze stress of the spherical fixed ring journal bearing. Developed calculation programs in the MATLAB software which are utilized to calculate the contact characteristics of roller cone drill bits spherical fixed ring journal bearing. In addition, effects of external load, radius clearance values, and material parameter on the mechanics performance were investigated. The results show that the value of external load has a direct pronounced effect on the contact characteristics of journal bearing. There is a significant positive correlation between contact pressure and external load, radius clearance value, and the Young's modulus of material. However, there is an evident negative correlation between contact radius of journal bearing and radius clearance value, and the Young's modulus of material. The smaller radius clearance value of journal bearing is, the more centralized contact region will be, so the corresponding contact pressure will be higher. From the perspective of reducing friction and wear, we need select the materials which have high strength and good toughness. Not only might this can improve the wear resistance, it also effectively decreases the contact pressure. In this case, we can prolong the service life of roller cone drill bits journal bearing

    Frozen CLIP Model is An Efficient Point Cloud Backbone

    Full text link
    The pretraining-finetuning paradigm has demonstrated great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field since the training data is limited and point cloud collection is expensive. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP model. Our EPCL connects the 2D and 3D modalities by semantically aligning the 2D features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a sequence of tokens and directly fed into the frozen CLIP model to learn point cloud representation. Furthermore, we design a task token to narrow the gap between 2D images and 3D point clouds. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the 2D CLIP model can be an efficient point cloud backbone and our method achieves state-of-the-art accuracy on both real-world and synthetic downstream tasks. Code will be available.Comment: Technical repor

    Ponder: Point Cloud Pre-training via Neural Rendering

    Full text link
    We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.Comment: Project page: https://dihuang.me/ponder

    MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

    Full text link
    Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach

    Experts Weights Averaging: A New General Training Scheme for Vision Transformers

    Full text link
    Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.Comment: 12 pages, 2 figure
    corecore