117 research outputs found
UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View
In the field of 3D object detection for autonomous driving, the sensor
portfolio including multi-modality and single-modality is diverse and complex.
Since the multi-modal methods have system complexity while the accuracy of
single-modal ones is relatively low, how to make a tradeoff between them is
difficult. In this work, we propose a universal cross-modality knowledge
distillation framework (UniDistill) to improve the performance of
single-modality detectors. Specifically, during training, UniDistill projects
the features of both the teacher and the student detector into Bird's-Eye-View
(BEV), which is a friendly representation for different modalities. Then, three
distillation losses are calculated to sparsely align the foreground features,
helping the student learn from the teacher without introducing additional cost
during inference. Taking advantage of the similar detection paradigm of
different detectors in BEV, UniDistill easily supports LiDAR-to-camera,
camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Furthermore, the three distillation losses can filter the effect of misaligned
background information and balance between objects of different sizes,
improving the distillation effectiveness. Extensive experiments on nuScenes
demonstrate that UniDistill effectively improves the mAP and NDS of student
detectors by 2.0%~3.2%
Collaborative Neural Rendering using Anime Character Sheets
Drawing images of characters with desired poses is an essential but laborious
task in anime production. Assisting artists to create is a research hotspot in
recent years. In this paper, we present the Collaborative Neural Rendering
(CoNR) method, which creates new images for specified poses from a few
reference images (AKA Character Sheets). In general, the diverse hairstyles and
garments of anime characters defies the employment of universal body models
like SMPL, which fits in most nude human shapes. To overcome this, CoNR uses a
compact and easy-to-obtain landmark encoding to avoid creating a unified UV
mapping in the pipeline. In addition, the performance of CoNR can be
significantly improved when referring to multiple reference images, thanks to
feature space cross-view warping in a carefully designed neural network.
Moreover, we have collected a character sheet dataset containing over 700,000
hand-drawn and synthesized images of diverse poses to facilitate research in
this area. Our code and demo are available at
https://github.com/megvii-research/IJCAI2023-CoNR.Comment: The first three authors contribute equally. In the Arts and
Creativity Track of IJCAI202
Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution
The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual
quality of videos, by simultaneously performing video frame interpolation (VFI)
and video super-resolution (VSR). However, facing the challenge of the
additional temporal dimension and scale inconsistency, most existing STVSR
methods are complex and inflexible in dynamically modeling different motion
amplitudes. In this work, we find that choosing an appropriate processing scale
achieves remarkable benefits in flow-based feature propagation. We propose a
novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects
sub-networks with different processing scales for individual samples.
Experiments on four public STVSR benchmarks demonstrate that SAFA achieves
state-of-the-art performance. Our SAFA network outperforms recent
state-of-the-art methods such as TMNet and VideoINR by an average improvement
of over 0.5dB on PSNR, while requiring less than half the number of parameters
and only 1/3 computational costs.Comment: WACV2024, 16 page
Fairness in Recommendation: Foundations, Methods and Applications
As one of the most pervasive applications of machine learning, recommender
systems are playing an important role on assisting human decision making. The
satisfaction of users and the interests of platforms are closely related to the
quality of the generated recommendation results. However, as a highly
data-driven system, recommender system could be affected by data or algorithmic
bias and thus generate unfair results, which could weaken the reliance of the
systems. As a result, it is crucial to address the potential unfairness
problems in recommendation settings. Recently, there has been growing attention
on fairness considerations in recommender systems with more and more literature
on approaches to promote fairness in recommendation. However, the studies are
rather fragmented and lack a systematic organization, thus making it difficult
to penetrate for new researchers to the domain. This motivates us to provide a
systematic survey of existing works on fairness in recommendation. This survey
focuses on the foundations for fairness in recommendation literature. It first
presents a brief introduction about fairness in basic machine learning tasks
such as classification and ranking in order to provide a general overview of
fairness research, as well as introduce the more complex situations and
challenges that need to be considered when studying fairness in recommender
systems. After that, the survey will introduce fairness in recommendation with
a focus on the taxonomies of current fairness definitions, the typical
techniques for improving fairness, as well as the datasets for fairness studies
in recommendation. The survey also talks about the challenges and opportunities
in fairness research with the hope of promoting the fair recommendation
research area and beyond.Comment: Accepted by ACM Transactions on Intelligent Systems and Technology
(TIST
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Open-vocabulary learning has emerged as a cutting-edge research area,
particularly in light of the widespread adoption of vision-based foundational
models. Its primary objective is to comprehend novel concepts that are not
encompassed within a predefined vocabulary. One key facet of this endeavor is
Visual Grounding, which entails locating a specific region within an image
based on a corresponding language description. While current foundational
models excel at various visual language tasks, there's a noticeable absence of
models specifically tailored for open-vocabulary visual grounding. This
research endeavor introduces novel and challenging OV tasks, namely
Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The
overarching aim is to establish connections between language descriptions and
the localization of novel objects. To facilitate this, we have curated a
comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000
OV-PL images. In our pursuit of addressing these challenges, we delved into
various baseline methodologies rooted in existing open-vocabulary object
detection, VG, and phrase localization frameworks. Surprisingly, we discovered
that state-of-the-art methods often falter in diverse scenarios. Consequently,
we developed a novel framework that integrates two critical components:
Text-Image Query Selection and Language-Guided Feature Attention. These modules
are designed to bolster the recognition of novel categories and enhance the
alignment between visual and linguistic information. Extensive experiments
demonstrate the efficacy of our proposed framework, which consistently attains
SOTA performance across the OV-VG task. Additionally, ablation studies provide
further evidence of the effectiveness of our innovative models. Codes and
datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG
MMOTU: A Multi-Modality Ovarian Tumor Ultrasound Image Dataset for Unsupervised Cross-Domain Semantic Segmentation
Ovarian cancer is one of the most harmful gynecological diseases. Detecting
ovarian tumors in early stage with computer-aided techniques can efficiently
decrease the mortality rate. With the improvement of medical treatment
standard, ultrasound images are widely applied in clinical treatment. However,
recent notable methods mainly focus on single-modality ultrasound ovarian tumor
segmentation or recognition, which means there still lacks researches on
exploring the representation capability of multi-modality ultrasound ovarian
tumor images. To solve this problem, we propose a Multi-Modality Ovarian Tumor
Ultrasound (MMOTU) image dataset containing 1469 2d ultrasound images and 170
contrast enhanced ultrasonography (CEUS) images with pixel-wise and global-wise
annotations. Based on MMOTU, we mainly focus on unsupervised cross-domain
semantic segmentation task. To solve the domain shift problem, we propose a
feature alignment based architecture named Dual-Scheme Domain-Selected Network
(DS2Net). Specifically, we first design source-encoder and target-encoder to
extract two-style features of source and target images. Then, we propose
Domain-Distinct Selected Module (DDSM) and Domain-Universal Selected Module
(DUSM) to extract the distinct and universal features in two styles
(source-style or target-style). Finally, we fuse these two kinds of features
and feed them into the source-decoder and target-decoder to generate final
predictions. Extensive comparison experiments and analysis on MMOTU image
dataset show that DS2Net can boost the segmentation performance for
bidirectional cross-domain adaptation of 2d ultrasound images and CEUS images.
Our proposed dataset and code are all available at
https://github.com/cv516Buaa/MMOTU_DS2Net.Comment: code: https://github.com/cv516Buaa/MMOTU_DS2Net paper:18 pages, 12
figures, 11 tables, 16 formula
WEA-DINO: An Improved DINO With Word Embedding Alignment for Remote Scene Zero-Shot Object Detection
Remote sensing scene zero-shot object detection (ZSD) aims to detect and recognize both seen and unseen categories of landscape elements with the guidance of the word embeddings. In this task, two primary challenges are identified. First, there exists considerable variability within categories of landscape elements, causing a misalignment between visual features and word embeddings, particularly noticeable for unseen categories. Second, the existing detection models struggle to provide accurate localization predictions, greatly impacting overall performance. To address these two issues, we propose word embedding alignment-DINO (WEA-DINO). Based on the original DINO structure, our WEA-DINO-Head is specifically designed to align the hidden features of 'matching queries' with word embedding features, effectively addressing the misalignment issue between visual features and word embeddings. Furthermore, aligning the hidden features of 'denoising queries' with word embedding features enables the translation of localization capabilities from known categories to previously unseen ones. Through extensive experimentation on the DIOR benchmark dataset, our method demonstrates state-of-the-art (SOTA) performance. The code is available at https://github.com/cv516Buaa/WEA-DINO
Multi-Task Recommendations with Reinforcement Learning
In recent years, Multi-task Learning (MTL) has yielded immense success in
Recommender System (RS) applications. However, current MTL-based recommendation
models tend to disregard the session-wise patterns of user-item interactions
because they are predominantly constructed based on item-wise datasets.
Moreover, balancing multiple objectives has always been a challenge in this
field, which is typically avoided via linear estimations in existing works. To
address these issues, in this paper, we propose a Reinforcement Learning (RL)
enhanced MTL framework, namely RMTL, to combine the losses of different
recommendation tasks using dynamic weights. To be specific, the RMTL structure
can address the two aforementioned issues by (i) constructing an MTL
environment from session-wise interactions and (ii) training multi-task
actor-critic network structure, which is compatible with most existing
MTL-based recommendation models, and (iii) optimizing and fine-tuning the MTL
loss function using the weights generated by critic networks. Experiments on
two real-world public datasets demonstrate the effectiveness of RMTL with a
higher AUC against state-of-the-art MTL-based recommendation models.
Additionally, we evaluate and validate RMTL's compatibility and transferability
across various MTL models.Comment: TheWebConf202
- …