43 research outputs found
Self-Feedback DETR for Temporal Action Detection
Temporal Action Detection (TAD) is challenging but fundamental for real-world
video applications. Recently, DETR-based models have been devised for TAD but
have not performed well yet. In this paper, we point out the problem in the
self-attention of DETR for TAD; the attention modules focus on a few key
elements, called temporal collapse problem. It degrades the capability of the
encoder and decoder since their self-attention modules play no role. To solve
the problem, we propose a novel framework, Self-DETR, which utilizes
cross-attention maps of the decoder to reactivate self-attention modules. We
recover the relationship between encoder features by simple matrix
multiplication of the cross-attention map and its transpose. Likewise, we also
get the information within decoder queries. By guiding collapsed self-attention
maps with the guidance map calculated, we settle down the temporal collapse of
self-attention modules in the encoder and decoder. Our extensive experiments
demonstrate that Self-DETR resolves the temporal collapse problem by keeping
high diversity of attention over all layers.Comment: Accepted to ICCV 202
Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition
A dramatic increase in real-world video volume with extremely diverse and
emerging topics naturally forms a long-tailed video distribution in terms of
their categories, and it spotlights the need for Video Long-Tailed Recognition
(VLTR). In this work, we summarize the challenges in VLTR and explore how to
overcome them. The challenges are: (1) it is impractical to re-train the whole
model for high-quality features, (2) acquiring frame-wise labels requires
extensive cost, and (3) long-tailed data triggers biased training. Yet, most
existing works for VLTR unavoidably utilize image-level features extracted from
pretrained models which are task-irrelevant, and learn by video-level labels.
Therefore, to deal with such (1) task-irrelevant features and (2) video-level
labels, we introduce two complementary learnable feature aggregators. Learnable
layers in each aggregator are to produce task-relevant representations, and
each aggregator is to assemble the snippet-wise knowledge into a video
representative. Then, we propose Minority-Oriented Vicinity Expansion (MOVE)
that explicitly leverages the class frequency into approximating the vicinity
distributions to alleviate (3) biased training. By combining these solutions,
our approach achieves state-of-the-art results on large-scale VideoLT and
synthetically induced Imbalanced-MiniKinetics200. With VideoLT features from
ResNet-50, it attains 18% and 58% relative improvements on head and tail
classes over the previous state-of-the-art method, respectively.Comment: Accepted to AAAI 2023. Code is available at
https://github.com/wjun0830/MOV
VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting
Zero-Shot Object Counting (ZSOC) aims to count referred instances of
arbitrary classes in a query image without human-annotated exemplars. To deal
with ZSOC, preceding studies proposed a two-stage pipeline: discovering
exemplars and counting. However, there remains a challenge of vulnerability to
error propagation of the sequentially designed two-stage process. In this work,
an one-stage baseline, Visual-Language Baseline (VLBase), exploring the
implicit association of the semantic-patch embeddings of CLIP is proposed.
Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is
achieved by incorporating three modules devised to tailor VLBase for object
counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within
the image encoder to acquire target-highlighted representations. Second,
Learnable Affine Transformation (LAT) is employed to translate the
semantic-patch similarity map to be appropriate for the counting task. Lastly,
the layer-wisely encoded features are transferred to the decoder through
Segment-aware Skip Connection (SaSC) to keep the generalization capability for
unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the
benefits of the end-to-end framework, VLCounter, are demonstrated.Comment: Accepted to AAAI 2024. Code is available at
https://github.com/Seunggu0305/VLCounte
Leveraging Hidden Positives for Unsupervised Semantic Segmentation
Dramatic demand for manpower to label pixel-level annotations triggered the
advent of unsupervised semantic segmentation. Although the recent work
employing the vision transformer (ViT) backbone shows exceptional performance,
there is still a lack of consideration for task-specific training guidance and
local semantic consistency. To tackle these issues, we leverage contrastive
learning by excavating hidden positives to learn rich semantic relationships
and ensure semantic consistency in local regions. Specifically, we first
discover two types of global hidden positives, task-agnostic and task-specific
ones for each anchor based on the feature similarities defined by a fixed
pre-trained backbone and a segmentation head-in-training, respectively. A
gradual increase in the contribution of the latter induces the model to capture
task-specific semantic features. In addition, we introduce a gradient
propagation strategy to learn semantic consistency between adjacent patches,
under the inherent premise that nearby patches are highly likely to possess the
same semantics. Specifically, we add the loss propagating to local hidden
positives, semantically similar nearby patches, in proportion to the predefined
similarity scores. With these training schemes, our proposed method achieves
new state-of-the-art (SOTA) results in COCO-stuff, Cityscapes, and Potsdam-3
datasets. Our code is available at: https://github.com/hynnsk/HP.Comment: Accepted to CVPR 202
VLSH: Voronoi-based Locality Sensitive Hashing
Abstract-We present a fast, yet accurate k-nearest neighbor search algorithm for high-dimensional sampling-based motion planners. Our technique is built on top of Locality Sensitive Hashing (LSH), but is extended to support arbitrary distance metrics used for motion planning problems and adapt irregular distributions of samples generated in the configuration space. To enable such novel characteristics our method embeds samples generated in the configuration space into a simple l2 norm space by using pivot points. We then implicitly define Voronoi regions and use local LSHs with varying quantization factors for those Voronoi regions. We have applied our method and other prior techniques to high-dimensional motion planning problems. Our method is able to show performance improvement by a factor of up to three times even with higher accuracy over prior, approximate nearest neighbor search techniques
Effects of different creep feed types on pre-weaning and post-weaning performance and gut development
Objective This experiment was carried out to determine the effects of different creep feed types on suckling performance and further adjustments to solid feed after weaning. Methods A total of 24 multiparous sows and their litters were allotted to one of three treatment groups: i) provided highly digestible creep feed (Creep), ii) provided a pig weaning diet (Weaner), and iii) provided sow feed (Sow) as creep feed until weaning. After weaning, a total of 96 piglets were selected for evaluation of post-weaning performance. Results For pre-weaning performance, the Creep treatment led to a significantly higher feed intake from 14 to 28 d (p<0.05) and higher body weight gain from 21 to 28 d than piglets that were provided other diets. However, after weaning, the Weaner treatment yielded a significantly higher feed intake and average daily gain than other treatments from 0 to 14 d after weaning (p<0.05); Creep treatment tended to generate lower villus heights in the duodenum than the other treatments (p = 0.07). Conclusion Highly digestible creep feed improved pre-weaning performance, but feed familiarity and grain-based creep feed improved post-weaning performance
Stochastic Particle Flow for Nonlinear High-Dimensional Filtering Problems
A series of novel filters for probabilistic inference that propose an alternative way of performing Bayesian updates, called particle flow filters, have been attracting recent interest. These filters provide approximate solutions to nonlinear filtering problems. They do so by defining a continuum of densities between the prior probability density and the posterior, i.e. the filtering density. Building on these methods' successes, we propose a novel filter. The new filter aims to address the shortcomings of sequential Monte Carlo methods when applied to important nonlinear high-dimensional filtering problems. The novel filter uses equally weighted samples, each of which is associated with a local solution of the Fokker-Planck equation. This hybrid of Monte Carlo and local parametric approximation gives rise to a global approximation of the filtering density of interest. We show that, when compared with state-of-the-art methods, the Gaussian-mixture implementation of the new filtering technique, which we call Stochastic Particle Flow, has utility in the context of benchmark nonlinear high-dimensional filtering problems. In addition, we extend the original particle flow filters for tackling multi-target multi-sensor tracking problems to enable a comparison with the new filter
Progressive Few-Shot Adaptation of Generative Model with Align-Free Spatial Correlation
In few-shot generative model adaptation, the model for target domain is prone to the mode-collapse. Recent studies attempted to mitigate the problem by matching the relationship among samples generated from the same latent codes in source and target domains. The objective is further extended to image patch-level to transfer the spatial correlation within an instance. However, the patch-level approach assumes the consistency of spatial structure between source and target domains. For example, the positions of eyes in two domains are almost identical. Thus, it can bring visual artifacts if source and target domain images are not nicely aligned. In this paper, we propose a few-shot generative model adaptation method free from such assumption, based on a motivation that generative models are progressively adapting from the source domain to the target domain. Such progressive changes allow us to identify semantically coherent image regions between instances generated by models at a neighboring training iteration to consider the spatial correlation. We also propose an importance-based patch selection strategy to reduce the complexity of patch-level correlation matching. Our method shows the state-of-the-art few-shot domain adaptation performance in the qualitative and quantitative evaluations
Task Discrepancy Maximization for Fine-grained Few-Shot Classification
Recognizing discriminative details such as eyes and beaks is important for
distinguishing fine-grained classes since they have similar overall
appearances. In this regard, we introduce Task Discrepancy Maximization (TDM),
a simple module for fine-grained few-shot classification. Our objective is to
localize the class-wise discriminative regions by highlighting channels
encoding distinct information of the class. Specifically, TDM learns
task-specific channel weights based on two novel components: Support Attention
Module (SAM) and Query Attention Module (QAM). SAM produces a support weight to
represent channel-wise discriminative power for each class. Still, since the
SAM is basically only based on the labeled support sets, it can be vulnerable
to bias toward such support set. Therefore, we propose QAM which complements
SAM by yielding a query weight that grants more weight to object-relevant
channels for a given query image. By combining these two weights, a class-wise
task-specific channel weight is defined. The weights are then applied to
produce task-adaptive feature maps more focusing on the discriminative details.
Our experiments validate the effectiveness of TDM and its complementary
benefits with prior methods in fine-grained few-shot classification.Comment: Accepted to CVPR 2022 as an oral presentation. Code is available at
https://github.com/leesb7426/CVPR2022-Task-Discrepancy-Maximization-for-Fine-grained-Few-Shot-Classificatio
Cross-Loss Pseudo Labeling for Semi-Supervised Segmentation
Training semantic segmentation models requires pixel-level annotations, leading to a significant labeling cost in dataset creation. To alleviate this issue, recent research has focused on semi-supervised learning, which utilizes only a small amount of annotation. In this context, pseudo labeling techniques are frequently employed to assign labels to unlabeled data based on the model’s predictions. However, there are fundamental limitations associated with the widespread application of pseudo labeling in this regard. Since pseudo labels are generally determined by the model’s predictions, these labels could be overconfidently assigned even for erroneous predictions, especially when the model has a confirmation bias. We observed that the overconfident prediction tendency of the cross-entropy loss exacerbates this issue, and to address it, we discover the focal loss, known for enabling more reliable confidence estimation, can complement the cross-entropy loss. The cross-entropy loss produces rich labels since it tends to be overconfident. On the other hand, the focal loss provides more conservative confidence, therefore, it produces a smaller number of pseudo labels compared to the cross-entropy. Based on such complementary mechanisms of two loss functions, we propose a simple yet effective pseudo labeling technique, Cross-Loss Pseudo Labeling (CLP), that alleviates the confirmation bias and lack of pseudo label problems. Intuitively, we can mitigate the overconfidence of the cross-entropy with the conservative predictions of the focal loss, while increasing the number of pseudo labels marked by the focal loss based on the cross-entropy. Additionally, CLP also contributes to improving the performance of the tail classes in class-imbalanced datasets through the class bias mitigation effect of the focal loss. In experimental results, our simple CLP improves mIoU by up to +10.4%p compared to a supervised model when only 1/32 true labels are available on PASCAL VOC 2012, and it surpassed the performance of the state-of-the-art methods