33 research outputs found
Visual Recognition with Deep Nearest Centroids
We devise deep nearest centroids (DNC), a conceptually elegant yet
surprisingly effective network for large-scale visual recognition, by
revisiting Nearest Centroids, one of the most classic and simple classifiers.
Current deep models learn the classifier in a fully parametric manner, ignoring
the latent data structure and lacking simplicity and explainability. DNC
instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids
of training samples to describe class distributions and clearly explains the
classification as the proximity of test data and the class sub-centroids in the
feature space. Due to the distance-based nature, the network output
dimensionality is flexible, and all the learnable parameters are only for data
embedding. That means all the knowledge learnt for ImageNet classification can
be completely transferred for pixel recognition learning, under the
"pre-training and fine-tuning" paradigm. Apart from its nested simplicity and
intuitive decision-making mechanism, DNC can even possess ad-hoc explainability
when the sub-centroids are selected as actual training images that humans can
view and inspect. Compared with parametric counterparts, DNC performs better on
image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition
(ADE20K, Cityscapes), with improved transparency and fewer learnable
parameters, using various network architectures (ResNet, Swin) and segmentation
models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights
into related fields.Comment: 23 pages, 8 figure
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Vision-language navigation (VLN), which entails an agent to navigate 3D
environments following human instructions, has shown great advances. However,
current agents are built upon panoramic observations, which hinders their
ability to perceive 3D scene geometry and easily leads to ambiguous selection
of panoramic view. To address these limitations, we present a BEV Scene Graph
(BSG), which leverages multi-step BEV representations to encode scene layouts
and geometric cues of indoor environment under the supervision of 3D detection.
During navigation, BSG builds a local BEV representation at each step and
maintains a BEV-based global scene map, which stores and organizes all the
online collected local BEV representations according to their topological
relations. Based on BSG, the agent predicts a local BEV grid-level decision
score and a global graph-level decision score, combined with a sub-view
selection score on panoramic views, for more accurate action prediction. Our
approach significantly outperforms state-of-the-art methods on REVERIE, R2R,
and R4R, showing the potential of BEV perception in VLN.Comment: Accepted at ICCV 2023; Project page:
https://github.com/DefaultRui/BEV-Scene-Grap
Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation
Audio-visual navigation is an audio-targeted wayfinding task where a robot
agent is entailed to travel a never-before-seen 3D environment towards the
sounding source. In this article, we present ORAN, an omnidirectional
audio-visual navigator based on cross-task navigation skill transfer. In
particular, ORAN sharpens its two basic abilities for a such challenging task,
namely wayfinding and audio-visual information gathering. First, ORAN is
trained with a confidence-aware cross-task policy distillation (CCPD) strategy.
CCPD transfers the fundamental, point-to-point wayfinding skill that is well
trained on the large-scale PointGoal task to ORAN, so as to help ORAN to better
master audio-visual navigation with far fewer training samples. To improve the
efficiency of knowledge transfer and address the domain gap, CCPD is made to be
adaptive to the decision confidence of the teacher policy. Second, ORAN is
equipped with an omnidirectional information gathering (OIG) mechanism, i.e.,
gleaning visual-acoustic observations from different directions before
decision-making. As a result, ORAN yields more robust navigation behaviour.
Taking CCPD and OIG together, ORAN significantly outperforms previous
competitors. After the model ensemble, we got 1st in Soundspaces Challenge
2022, improving SPL and SR by 53% and 35% relatively.Comment: ICCV 202
Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing
To address the challenging task of instance-aware human part parsing, a new
bottom-up regime is proposed to learn category-level human semantic
segmentation as well as multi-person pose estimation in a joint and end-to-end
manner. It is a compact, efficient and powerful framework that exploits
structural information over different human granularities and eases the
difficulty of person partitioning. Specifically, a dense-to-sparse projection
field, which allows explicitly associating dense human semantics with sparse
keypoints, is learnt and progressively improved over the network feature
pyramid for robustness. Then, the difficult pixel grouping problem is cast as
an easier, multi-person joint assembling task. By formulating joint association
as maximum-weight bipartite matching, a differentiable solution is developed to
exploit projected gradient descent and Dykstra's cyclic projection algorithm.
This makes our method end-to-end trainable and allows back-propagating the
grouping error to directly supervise multi-granularity human representation
learning. This is distinguished from current bottom-up human parsers or pose
estimators which require sophisticated post-processing or heuristic greedy
algorithms. Experiments on three instance-aware human parsing datasets show
that our model outperforms other bottom-up alternatives with much more
efficient inference.Comment: CVPR 2021 (Oral). Code: https://github.com/tfzhou/MG-HumanParsin
ClusterFormer: Clustering As A Universal Visual Learner
This paper presents CLUSTERFORMER, a universal vision model that is based on
the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1.
recurrent cross-attention clustering, which reformulates the cross-attention
mechanism in Transformer and enables recursive updates of cluster centers to
facilitate strong representation learning; and 2. feature dispatching, which
uses the updated cluster centers to redistribute image features through
similarity-based metrics, resulting in a transparent pipeline. This elegant
design streamlines an explainable and transferable workflow, capable of
tackling heterogeneous vision tasks (i.e., image classification, object
detection, and image segmentation) with varying levels of clustering
granularity (i.e., image-, box-, and pixel-level). Empirical results
demonstrate that CLUSTERFORMER outperforms various well-known specialized
architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image
classification, 54.2% and 47.0% mAP over MS COCO for object detection and
instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and
55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we
hope our work can catalyze a paradigm shift in universal models in computer
vision
Target-Driven Structured Transformer Planner for Vision-Language Navigation
Vision-language navigation is the task of directing an embodied agent to
navigate in 3D scenes with natural language instructions. For the agent,
inferring the long-term navigation target from visual-linguistic clues is
crucial for reliable path planning, which, however, has rarely been studied
before in literature. In this article, we propose a Target-Driven Structured
Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware
navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism
for explicit estimation of the long-term target (even located in unexplored
environments). In addition, we design a Structured Transformer Planner which
elegantly incorporates the explored room layout into a neural attention
architecture for structured and global planning. Experimental results
demonstrate that our TD-STP substantially improves previous best methods'
success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks,
respectively. Our code is available at https://github.com/YushengZhao/TD-STP
Real-time superpixel segmentation by DBSCAN clustering algorithm
In this paper, we propose a real-time image superpixel segmentation method with 50 frames/s by using the density-based spatial clustering of applications with noise (DBSCAN) algorithm. In order to decrease the computational costs of superpixel algorithms, we adopt a fast two-step framework. In the first clustering stage, the DBSCAN algorithm with color-similarity and geometric restrictions is used to rapidly cluster the pixels, and then, small clusters are merged into superpixels by their neighborhood through a distance measurement defined by color and spatial features in the second merging stage. A robust and simple distance function is defined for obtaining better superpixels in these two steps. The experimental results demonstrate that our real-time superpixel algorithm (50 frames/s) by the DBSCAN clustering outperforms the state-of-the-art superpixel segmentation methods in terms of both accuracy and efficiency
Organic-Inorganic Perovskite Light-Emitting Electrochemical Cells with a Large Capacitance
While perovskite light-emitting diodes typically made with high work function anodes and low work function cathodes have recently gained intense interests. Perovskite light-emitting devices with two high work function electrodes with interesting features are demonstrated here. Firstly, electroluminescence can be easily obtained from both forward and reverse biases. Secondly, the results of impedance spectroscopy indicate that the ionic conductivity in the iodide perovskite (CH3 NH3PbI3) is large with a value of approximate to 10(-8) S cm(-1). Thirdly, the shift of the emission spectrum in the mixed halide perovskite (CH3NH3PbI3-Br-x(x)) light-emitting devices indicates that I(-)ions are mobile in the perovskites. Fourthly, this work shows that the accumulated ions at the interfaces result in a large capacitance (approximate to 100 mu F cm(-2)). The above results conclusively prove that the organic-inorganic halide perovskites are solid electrolytes with mixed ionic and electronic conductivity and the light-emitting device is a light-emitting electrochemical cell. The work also suggests that the organic-inorganic halide perovskites are potential energy-storage materials, which may be applicable in the field of solid-state supercapacitors and batteries.While perovskite light-emitting diodes typically made with high work function anodes and low work function cathodes have recently gained intense interests. Perovskite light-emitting devices with two high work function electrodes with interesting features are demonstrated here. Firstly, electroluminescence can be easily obtained from both forward and reverse biases. Secondly, the results of impedance spectroscopy indicate that the ionic conductivity in the iodide perovskite (CH3NH3PbI3) is large with a value of ≈10-8 S cm-1. Thirdly, the shift of the emission spectrum in the mixed halide perovskite (CH3NH3PbI3-xBrx) light-emitting devices indicates that I- ions are mobile in the perovskites. Fourthly, this work shows that the accumulated ions at the interfaces result in a large capacitance (≈100 μF cm-2). The above results conclusively prove that the organic-inorganic halide perovskites are solid electrolytes with mixed ionic and electronic conductivity and the light-emitting device is a light-emitting electrochemical cell. The work also suggests that the organic-inorganic halide perovskites are potential energy-storage materials, which may be applicable in the field of solid-state supercapacitors and batteries. Light-emitting electrochemical cells (LECs) of organic-inorganic perovskite (CH3NH3PbI3) with two high work function electrodes are demonstrated. Results indicate that CH3NH3PbI3 has an ionic conductivity of ≈10-8 S cm-1. The accumulated ions at the interfaces result in a large capacitance, which suggests a potential application in electrochemical energy-storage devices, such as solid-state supercapacitors and batteries