188 research outputs found
MVPNet: Multi-View Point Regression Networks for 3D Object Reconstruction from A Single Image
In this paper, we address the problem of reconstructing an object's surface
from a single image using generative networks. First, we represent a 3D surface
with an aggregation of dense point clouds from multiple views. Each point cloud
is embedded in a regular 2D grid aligned on an image plane of a viewpoint,
making the point cloud convolution-favored and ordered so as to fit into deep
network architectures. The point clouds can be easily triangulated by
exploiting connectivities of the 2D grids to form mesh-based surfaces. Second,
we propose an encoder-decoder network that generates such kind of multiple
view-dependent point clouds from a single image by regressing their 3D
coordinates and visibilities. We also introduce a novel geometric loss that is
able to interpret discrepancy over 3D surfaces as opposed to 2D projective
planes, resorting to the surface discretization on the constructed meshes. We
demonstrate that the multi-view point regression network outperforms
state-of-the-art methods with a significant improvement on challenging
datasets.Comment: 8 pages; accepted by AAAI 201
Learning Fully Dense Neural Networks for Image Semantic Segmentation
Semantic segmentation is pixel-wise classification which retains critical
spatial information. The "feature map reuse" has been commonly adopted in CNN
based approaches to take advantage of feature maps in the early layers for the
later spatial reconstruction. Along this direction, we go a step further by
proposing a fully dense neural network with an encoder-decoder structure that
we abbreviate as FDNet. For each stage in the decoder module, feature maps of
all the previous blocks are adaptively aggregated to feed-forward as input. On
the one hand, it reconstructs the spatial boundaries accurately. On the other
hand, it learns more efficiently with the more efficient gradient
backpropagation. In addition, we propose the boundary-aware loss function to
focus more attention on the pixels near the boundary, which boosts the "hard
examples" labeling. We have demonstrated the best performance of the FDNet on
the two benchmark datasets: PASCAL VOC 2012, NYUDv2 over previous works when
not considering training on other datasets
Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation
Recently, transformer-based image segmentation methods have achieved notable
success against previous solutions. While for video domains, how to effectively
model temporal context with the attention of object instances across frames
remains an open problem. In this paper, we propose an online video instance
segmentation framework with a novel instance-aware temporal fusion method. We
first leverages the representation, i.e., a latent code in the global context
(instance code) and CNN feature maps to represent instance- and pixel-level
features. Based on this representation, we introduce a cropping-free temporal
fusion approach to model the temporal consistency between video frames.
Specifically, we encode global instance-specific information in the instance
code and build up inter-frame contextual fusion with hybrid attentions between
the instance codes and CNN feature maps. Inter-frame consistency between the
instance codes are further enforced with order constraints. By leveraging the
learned hybrid temporal consistency, we are able to directly retrieve and
maintain instance identities across frames, eliminating the complicated
frame-wise instance matching in prior methods. Extensive experiments have been
conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves
the best performance among all online VIS methods. Notably, our model also
eclipses all offline methods when using the ResNet-50 backbone.Comment: AAAI 202
Weakly-supervised Temporal Action Localization by Uncertainty Modeling
Weakly-supervised temporal action localization aims to learn detecting
temporal intervals of action classes with only video-level labels. To this end,
it is crucial to separate frames of action classes from the background frames
(i.e., frames not belonging to any action classes). In this paper, we present a
new perspective on background frames where they are modeled as
out-of-distribution samples regarding their inconsistency. Then, background
frames can be detected by estimating the probability of each frame being
out-of-distribution, known as uncertainty, but it is infeasible to directly
learn uncertainty without frame-level labels. To realize the uncertainty
learning in the weakly-supervised setting, we leverage the multiple instance
learning formulation. Moreover, we further introduce a background entropy loss
to better discriminate background frames by encouraging their in-distribution
(action) probabilities to be uniformly distributed over all action classes.
Experimental results show that our uncertainty modeling is effective at
alleviating the interference of background frames and brings a large
performance gain without bells and whistles. We demonstrate that our model
significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14
and ActivityNet (1.2 & 1.3). Our code is available at
https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.Comment: Accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI
2021
Thermal and mechanical quantitative sensory testing in Chinese patients with burning mouth syndrome:a probable neuropathic pain condition?
BACKGROUND: To explore the hypothesis that burning mouth syndrome (BMS) probably is a neuropathic pain condition, thermal and mechanical sensory and pain thresholds were tested and compared with age- and gender-matched control participants using a standardized battery of psychophysical techniques. METHODS: Twenty-five BMS patients (men: 8, women: 17, age: 49.5 ± 11.4 years) and 19 age- and gender-matched healthy control participants were included. The cold detection threshold (CDT), warm detection threshold (WDT), cold pain threshold (CPT), heat pain threshold (HPT), mechanical detection threshold (MDT) and mechanical pain threshold (MPT), in accordance with the German Network of Neuropathic Pain guidelines, were measured at the following four sites: the dorsum of the left hand (hand), the skin at the mental foramen (chin), on the tip of the tongue (tongue), and the mucosa of the lower lip (lip). Statistical analysis was performed using ANOVA with repeated measures to compare the means within and between groups. Furthermore, Z-score profiles were generated, and exploratory correlation analyses between QST and clinical variables were performed. Two-tailed tests with a significance level of 5 % were used throughout. RESULTS: CDTs (P < 0.02) were significantly lower (less sensitivity) and HPTs (P < 0.001) were significantly higher (less sensitivity) at the tongue and lip in BMS patients compared to control participants. WDT (P = 0.007) was also significantly higher at the tongue in BMS patients compared to control subjects . There were no significant differences in MDT and MPT between the BMS patients and healthy subjects at any of the four test sites. Z-scores showed that significant loss of function can be identified for CDT (Z-scores = −0.9±1.1) and HPT (Z-scores = 1.5±0.4). There were no significant correlations between QST and clinical variables (pain intensity, duration, depressions scores). CONCLUSION: BMS patients had a significant loss of thermal function but not mechanical function, supporting the hypothesis that BMS may be a probable neuropathic pain condition. Further studies including e.g. electrophysiological or imaging techniques are needed to clarify the underlying mechanisms of BMS
Online Video Instance Segmentation via Robust Context Fusion
Video instance segmentation (VIS) aims at classifying, segmenting and
tracking object instances in video sequences. Recent transformer-based neural
networks have demonstrated their powerful capability of modeling
spatio-temporal correlations for the VIS task. Relying on video- or clip-level
input, they suffer from high latency and computational cost. We propose a
robust context fusion network to tackle VIS in an online fashion, which
predicts instance segmentation frame-by-frame with a few preceding frames. To
acquire the precise and temporal-consistent prediction for each frame
efficiently, the key idea is to fuse effective and compact context from
reference frames into the target frame. Considering the different effects of
reference and target frames on the target prediction, we first summarize
contextual features through importance-aware compression. A transformer encoder
is adopted to fuse the compressed context. Then, we leverage an
order-preserving instance embedding to convey the identity-aware information
and correspond the identities to predicted instance masks. We demonstrate that
our robust fusion network achieves the best performance among existing online
VIS methods and is even better than previously published clip-level methods on
the Youtube-VIS 2019 and 2021 benchmarks. In addition, visual objects often
have acoustic signatures that are naturally synchronized with them in
audio-bearing video recordings. By leveraging the flexibility of our context
fusion network on multi-modal data, we further investigate the influence of
audios on the video-dense prediction task, which has never been discussed in
existing works. We build up an Audio-Visual Instance Segmentation dataset, and
demonstrate that acoustic signals in the wild scenarios could benefit the VIS
task
- …