29 research outputs found
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation
Unsupervised video object segmentation (VOS) is a task that aims to detect
the most salient object in a video without external guidance about the object.
To leverage the property that salient objects usually have distinctive
movements compared to the background, recent methods collaboratively use motion
cues extracted from optical flow maps with appearance cues extracted from RGB
images. However, as optical flow maps are usually very relevant to segmentation
masks, the network is easy to be learned overly dependent on the motion cues
during network training. As a result, such two-stream approaches are vulnerable
to confusing motion cues, making their prediction unstable. To relieve this
issue, we design a novel motion-as-option network by treating motion cues as
optional. During network training, RGB images are randomly provided to the
motion encoder instead of optical flow maps, to implicitly reduce motion
dependency of the network. As the learned motion encoder can deal with both RGB
images and optical flow maps, two different predictions can be generated
depending on which source information is used as motion input. In order to
fully exploit this property, we also propose an adaptive output selection
algorithm to adopt optimal prediction result at test time. Our proposed
approach affords state-of-the-art performance on all public benchmark datasets,
even maintaining real-time inference speed
Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning
Occluded person re-identification (Re-ID) in images captured by multiple
cameras is challenging because the target person is occluded by pedestrians or
objects, especially in crowded scenes. In addition to the processes performed
during holistic person Re-ID, occluded person Re-ID involves the removal of
obstacles and the detection of partially visible body parts. Most existing
methods utilize the off-the-shelf pose or parsing networks as pseudo labels,
which are prone to error. To address these issues, we propose a novel Occlusion
Correction Network (OCNet) that corrects features through relational-weight
learning and obtains diverse and representative features without using external
networks. In addition, we present a simple concept of a center feature in order
to provide an intuitive solution to pedestrian occlusion scenarios.
Furthermore, we suggest the idea of Separation Loss (SL) for focusing on
different parts between global features and part features. We conduct extensive
experiments on five challenging benchmark datasets for occluded and holistic
Re-ID tasks to demonstrate that our method achieves superior performance to
state-of-the-art methods especially on occluded scene.Comment: ICASSP 202
Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation
Unsupervised video object segmentation aims at detecting and segmenting the
most salient object in videos. In recent times, two-stream approaches that
collaboratively leverage appearance cues and motion cues have attracted
extensive attention thanks to their powerful performance. However, there are
two limitations faced by those methods: 1) the domain gap between appearance
and motion information is not well considered; and 2) long-term temporal
coherence within a video sequence is not exploited. To overcome these
limitations, we propose a domain alignment module (DAM) and a temporal
aggregation module (TAM). DAM resolves the domain gap between two modalities by
forcing the values to be in the same range using a cross-correlation mechanism.
TAM captures long-term coherence by extracting and leveraging global cues of a
video. On public benchmark datasets, our proposed approach demonstrates its
effectiveness, outperforming all existing methods by a substantial margin
Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition
Skeleton-based action recognition has attracted considerable attention due to
its compact skeletal structure of the human body. Many recent methods have
achieved remarkable performance using graph convolutional networks (GCNs) and
convolutional neural networks (CNNs), which extract spatial and temporal
features, respectively. Although spatial and temporal dependencies in the human
skeleton have been explored, spatio-temporal dependency is rarely considered.
In this paper, we propose the Inter-Frame Curve Network (IFC-Net) to
effectively leverage the spatio-temporal dependency of the human skeleton. Our
proposed network consists of two novel elements: 1) The Inter-Frame Curve (IFC)
module; and 2) Dilated Graph Convolution (D-GC). The IFC module increases the
spatio-temporal receptive field by identifying meaningful node connections
between every adjacent frame and generating spatio-temporal curves based on the
identified node connections. The D-GC allows the network to have a large
spatial receptive field, which specifically focuses on the spatial domain. The
kernels of D-GC are computed from the given adjacency matrices of the graph and
reflect large receptive field in a way similar to the dilated CNNs. Our IFC-Net
combines these two modules and achieves state-of-the-art performance on three
skeleton-based action recognition benchmarks: NTU-RGB+D 60, NTU-RGB+D 120, and
Northwestern-UCLA.Comment: 12 pages, 5 figure
Pixel-Level Equalized Matching for Video Object Segmentation
Feature similarity matching, which transfers the information of the reference
frame to the query frame, is a key component in semi-supervised video object
segmentation. If surjective matching is adopted, background distractors can
easily occur and degrade the performance. Bijective matching mechanisms try to
prevent this by restricting the amount of information being transferred to the
query frame, but have two limitations: 1) surjective matching cannot be fully
leveraged as it is transformed to bijective matching at test time; and 2)
test-time manual tuning is required for searching the optimal hyper-parameters.
To overcome these limitations while ensuring reliable information transfer, we
introduce an equalized matching mechanism. To prevent the reference frame
information from being overly referenced, the potential contribution to the
query frame is equalized by simply applying a softmax operation along with the
query. On public benchmark datasets, our proposed approach achieves a
comparable performance to state-of-the-art methods
Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection
The camouflaged object detection (COD) task aims to find and segment objects
that have a color or texture that is very similar to that of the background.
Despite the difficulties of the task, COD is attracting attention in medical,
lifesaving, and anti-military fields. To overcome the difficulties of COD, we
propose a novel global-local aggregation architecture with a deformable point
sampling method. Further, we propose a global-local aggregation transformer
that integrates an object's global information, background, and boundary local
information, which is important in COD tasks. The proposed transformer obtains
global information from feature channels and effectively extracts important
local information from the subdivided patch using the deformable point sampling
method. Accordingly, the model effectively integrates global and local
information for camouflaged objects and also shows that important boundary
information in COD can be efficiently utilized. Our method is evaluated on
three popular datasets and achieves state-of-the-art performance. We prove the
effectiveness of the proposed method through comparative experiments
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation
Referring Image Segmentation (RIS) aims to segment target objects expressed
in natural language within a scene at the pixel level. Various recent RIS
models have achieved state-of-the-art performance by generating contextual
tokens to model multimodal features from pretrained encoders and effectively
fusing them using transformer-based cross-modal attention. While these methods
match language features with image features to effectively identify likely
target objects, they often struggle to correctly understand contextual
information in complex and ambiguous sentences and scenes. To address this
issue, we propose a novel bidirectional token-masking autoencoder (BTMAE)
inspired by the masked autoencoder (MAE). The proposed model learns the context
of image-to-language and language-to-image by reconstructing missing features
in both image and language features at the token level. In other words, this
approach involves mutually complementing across the features of images and
language, with a focus on enabling the network to understand interconnected
deep contextual information between the two modalities. This learning method
enhances the robustness of RIS performance in complex sentences and scenes. Our
BTMAE achieves state-of-the-art performance on three popular datasets, and we
demonstrate the effectiveness of the proposed method through various ablation
studies
Integrating Metal-Oxide-Decorated CNT Networks with a CMOS Readout in a Gas Sensor
We have implemented a tin-oxide-decorated carbon nanotube (CNT) network gas sensor system on a single die. We have also demonstrated the deposition of metallic tin on the CNT network, its subsequent oxidation in air, and the improvement of the lifetime of the sensors. The fabricated array of CNT sensors contains 128 sensor cells for added redundancy and increased accuracy. The read-out integrated circuit (ROIC) was combined with coarse and fine time-to-digital converters to extend its resolution in a power-efficient way. The ROIC is fabricated using a 0.35 μm CMOS process, and the whole sensor system consumes 30 mA at 5 V. The sensor system was successfully tested in the detection of ammonia gas at elevated temperatures