43 research outputs found

    Self-Feedback DETR for Temporal Action Detection

    Full text link
    Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers.Comment: Accepted to ICCV 202

    Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition

    Full text link
    A dramatic increase in real-world video volume with extremely diverse and emerging topics naturally forms a long-tailed video distribution in terms of their categories, and it spotlights the need for Video Long-Tailed Recognition (VLTR). In this work, we summarize the challenges in VLTR and explore how to overcome them. The challenges are: (1) it is impractical to re-train the whole model for high-quality features, (2) acquiring frame-wise labels requires extensive cost, and (3) long-tailed data triggers biased training. Yet, most existing works for VLTR unavoidably utilize image-level features extracted from pretrained models which are task-irrelevant, and learn by video-level labels. Therefore, to deal with such (1) task-irrelevant features and (2) video-level labels, we introduce two complementary learnable feature aggregators. Learnable layers in each aggregator are to produce task-relevant representations, and each aggregator is to assemble the snippet-wise knowledge into a video representative. Then, we propose Minority-Oriented Vicinity Expansion (MOVE) that explicitly leverages the class frequency into approximating the vicinity distributions to alleviate (3) biased training. By combining these solutions, our approach achieves state-of-the-art results on large-scale VideoLT and synthetically induced Imbalanced-MiniKinetics200. With VideoLT features from ResNet-50, it attains 18% and 58% relative improvements on head and tail classes over the previous state-of-the-art method, respectively.Comment: Accepted to AAAI 2023. Code is available at https://github.com/wjun0830/MOV

    VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting

    Full text link
    Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.Comment: Accepted to AAAI 2024. Code is available at https://github.com/Seunggu0305/VLCounte

    Leveraging Hidden Positives for Unsupervised Semantic Segmentation

    Full text link
    Dramatic demand for manpower to label pixel-level annotations triggered the advent of unsupervised semantic segmentation. Although the recent work employing the vision transformer (ViT) backbone shows exceptional performance, there is still a lack of consideration for task-specific training guidance and local semantic consistency. To tackle these issues, we leverage contrastive learning by excavating hidden positives to learn rich semantic relationships and ensure semantic consistency in local regions. Specifically, we first discover two types of global hidden positives, task-agnostic and task-specific ones for each anchor based on the feature similarities defined by a fixed pre-trained backbone and a segmentation head-in-training, respectively. A gradual increase in the contribution of the latter induces the model to capture task-specific semantic features. In addition, we introduce a gradient propagation strategy to learn semantic consistency between adjacent patches, under the inherent premise that nearby patches are highly likely to possess the same semantics. Specifically, we add the loss propagating to local hidden positives, semantically similar nearby patches, in proportion to the predefined similarity scores. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in COCO-stuff, Cityscapes, and Potsdam-3 datasets. Our code is available at: https://github.com/hynnsk/HP.Comment: Accepted to CVPR 202

    VLSH: Voronoi-based Locality Sensitive Hashing

    Get PDF
    Abstract-We present a fast, yet accurate k-nearest neighbor search algorithm for high-dimensional sampling-based motion planners. Our technique is built on top of Locality Sensitive Hashing (LSH), but is extended to support arbitrary distance metrics used for motion planning problems and adapt irregular distributions of samples generated in the configuration space. To enable such novel characteristics our method embeds samples generated in the configuration space into a simple l2 norm space by using pivot points. We then implicitly define Voronoi regions and use local LSHs with varying quantization factors for those Voronoi regions. We have applied our method and other prior techniques to high-dimensional motion planning problems. Our method is able to show performance improvement by a factor of up to three times even with higher accuracy over prior, approximate nearest neighbor search techniques

    Effects of different creep feed types on pre-weaning and post-weaning performance and gut development

    Get PDF
    Objective This experiment was carried out to determine the effects of different creep feed types on suckling performance and further adjustments to solid feed after weaning. Methods A total of 24 multiparous sows and their litters were allotted to one of three treatment groups: i) provided highly digestible creep feed (Creep), ii) provided a pig weaning diet (Weaner), and iii) provided sow feed (Sow) as creep feed until weaning. After weaning, a total of 96 piglets were selected for evaluation of post-weaning performance. Results For pre-weaning performance, the Creep treatment led to a significantly higher feed intake from 14 to 28 d (p<0.05) and higher body weight gain from 21 to 28 d than piglets that were provided other diets. However, after weaning, the Weaner treatment yielded a significantly higher feed intake and average daily gain than other treatments from 0 to 14 d after weaning (p<0.05); Creep treatment tended to generate lower villus heights in the duodenum than the other treatments (p = 0.07). Conclusion Highly digestible creep feed improved pre-weaning performance, but feed familiarity and grain-based creep feed improved post-weaning performance

    Stochastic Particle Flow for Nonlinear High-Dimensional Filtering Problems

    Get PDF
    A series of novel filters for probabilistic inference that propose an alternative way of performing Bayesian updates, called particle flow filters, have been attracting recent interest. These filters provide approximate solutions to nonlinear filtering problems. They do so by defining a continuum of densities between the prior probability density and the posterior, i.e. the filtering density. Building on these methods' successes, we propose a novel filter. The new filter aims to address the shortcomings of sequential Monte Carlo methods when applied to important nonlinear high-dimensional filtering problems. The novel filter uses equally weighted samples, each of which is associated with a local solution of the Fokker-Planck equation. This hybrid of Monte Carlo and local parametric approximation gives rise to a global approximation of the filtering density of interest. We show that, when compared with state-of-the-art methods, the Gaussian-mixture implementation of the new filtering technique, which we call Stochastic Particle Flow, has utility in the context of benchmark nonlinear high-dimensional filtering problems. In addition, we extend the original particle flow filters for tackling multi-target multi-sensor tracking problems to enable a comparison with the new filter

    Progressive Few-Shot Adaptation of Generative Model with Align-Free Spatial Correlation

    No full text
    In few-shot generative model adaptation, the model for target domain is prone to the mode-collapse. Recent studies attempted to mitigate the problem by matching the relationship among samples generated from the same latent codes in source and target domains. The objective is further extended to image patch-level to transfer the spatial correlation within an instance. However, the patch-level approach assumes the consistency of spatial structure between source and target domains. For example, the positions of eyes in two domains are almost identical. Thus, it can bring visual artifacts if source and target domain images are not nicely aligned. In this paper, we propose a few-shot generative model adaptation method free from such assumption, based on a motivation that generative models are progressively adapting from the source domain to the target domain. Such progressive changes allow us to identify semantically coherent image regions between instances generated by models at a neighboring training iteration to consider the spatial correlation. We also propose an importance-based patch selection strategy to reduce the complexity of patch-level correlation matching. Our method shows the state-of-the-art few-shot domain adaptation performance in the qualitative and quantitative evaluations

    Task Discrepancy Maximization for Fine-grained Few-Shot Classification

    Full text link
    Recognizing discriminative details such as eyes and beaks is important for distinguishing fine-grained classes since they have similar overall appearances. In this regard, we introduce Task Discrepancy Maximization (TDM), a simple module for fine-grained few-shot classification. Our objective is to localize the class-wise discriminative regions by highlighting channels encoding distinct information of the class. Specifically, TDM learns task-specific channel weights based on two novel components: Support Attention Module (SAM) and Query Attention Module (QAM). SAM produces a support weight to represent channel-wise discriminative power for each class. Still, since the SAM is basically only based on the labeled support sets, it can be vulnerable to bias toward such support set. Therefore, we propose QAM which complements SAM by yielding a query weight that grants more weight to object-relevant channels for a given query image. By combining these two weights, a class-wise task-specific channel weight is defined. The weights are then applied to produce task-adaptive feature maps more focusing on the discriminative details. Our experiments validate the effectiveness of TDM and its complementary benefits with prior methods in fine-grained few-shot classification.Comment: Accepted to CVPR 2022 as an oral presentation. Code is available at https://github.com/leesb7426/CVPR2022-Task-Discrepancy-Maximization-for-Fine-grained-Few-Shot-Classificatio

    Cross-Loss Pseudo Labeling for Semi-Supervised Segmentation

    No full text
    Training semantic segmentation models requires pixel-level annotations, leading to a significant labeling cost in dataset creation. To alleviate this issue, recent research has focused on semi-supervised learning, which utilizes only a small amount of annotation. In this context, pseudo labeling techniques are frequently employed to assign labels to unlabeled data based on the model&#x2019;s predictions. However, there are fundamental limitations associated with the widespread application of pseudo labeling in this regard. Since pseudo labels are generally determined by the model&#x2019;s predictions, these labels could be overconfidently assigned even for erroneous predictions, especially when the model has a confirmation bias. We observed that the overconfident prediction tendency of the cross-entropy loss exacerbates this issue, and to address it, we discover the focal loss, known for enabling more reliable confidence estimation, can complement the cross-entropy loss. The cross-entropy loss produces rich labels since it tends to be overconfident. On the other hand, the focal loss provides more conservative confidence, therefore, it produces a smaller number of pseudo labels compared to the cross-entropy. Based on such complementary mechanisms of two loss functions, we propose a simple yet effective pseudo labeling technique, Cross-Loss Pseudo Labeling (CLP), that alleviates the confirmation bias and lack of pseudo label problems. Intuitively, we can mitigate the overconfidence of the cross-entropy with the conservative predictions of the focal loss, while increasing the number of pseudo labels marked by the focal loss based on the cross-entropy. Additionally, CLP also contributes to improving the performance of the tail classes in class-imbalanced datasets through the class bias mitigation effect of the focal loss. In experimental results, our simple CLP improves mIoU by up to +10.4&#x0025;p compared to a supervised model when only 1/32 true labels are available on PASCAL VOC 2012, and it surpassed the performance of the state-of-the-art methods
    corecore