188 research outputs found

    MVPNet: Multi-View Point Regression Networks for 3D Object Reconstruction from A Single Image

    Full text link
    In this paper, we address the problem of reconstructing an object's surface from a single image using generative networks. First, we represent a 3D surface with an aggregation of dense point clouds from multiple views. Each point cloud is embedded in a regular 2D grid aligned on an image plane of a viewpoint, making the point cloud convolution-favored and ordered so as to fit into deep network architectures. The point clouds can be easily triangulated by exploiting connectivities of the 2D grids to form mesh-based surfaces. Second, we propose an encoder-decoder network that generates such kind of multiple view-dependent point clouds from a single image by regressing their 3D coordinates and visibilities. We also introduce a novel geometric loss that is able to interpret discrepancy over 3D surfaces as opposed to 2D projective planes, resorting to the surface discretization on the constructed meshes. We demonstrate that the multi-view point regression network outperforms state-of-the-art methods with a significant improvement on challenging datasets.Comment: 8 pages; accepted by AAAI 201

    Learning Fully Dense Neural Networks for Image Semantic Segmentation

    Full text link
    Semantic segmentation is pixel-wise classification which retains critical spatial information. The "feature map reuse" has been commonly adopted in CNN based approaches to take advantage of feature maps in the early layers for the later spatial reconstruction. Along this direction, we go a step further by proposing a fully dense neural network with an encoder-decoder structure that we abbreviate as FDNet. For each stage in the decoder module, feature maps of all the previous blocks are adaptively aggregated to feed-forward as input. On the one hand, it reconstructs the spatial boundaries accurately. On the other hand, it learns more efficiently with the more efficient gradient backpropagation. In addition, we propose the boundary-aware loss function to focus more attention on the pixels near the boundary, which boosts the "hard examples" labeling. We have demonstrated the best performance of the FDNet on the two benchmark datasets: PASCAL VOC 2012, NYUDv2 over previous works when not considering training on other datasets

    Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

    Full text link
    Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverages the representation, i.e., a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes are further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.Comment: AAAI 202

    Weakly-supervised Temporal Action Localization by Uncertainty Modeling

    Full text link
    Weakly-supervised temporal action localization aims to learn detecting temporal intervals of action classes with only video-level labels. To this end, it is crucial to separate frames of action classes from the background frames (i.e., frames not belonging to any action classes). In this paper, we present a new perspective on background frames where they are modeled as out-of-distribution samples regarding their inconsistency. Then, background frames can be detected by estimating the probability of each frame being out-of-distribution, known as uncertainty, but it is infeasible to directly learn uncertainty without frame-level labels. To realize the uncertainty learning in the weakly-supervised setting, we leverage the multiple instance learning formulation. Moreover, we further introduce a background entropy loss to better discriminate background frames by encouraging their in-distribution (action) probabilities to be uniformly distributed over all action classes. Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles. We demonstrate that our model significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14 and ActivityNet (1.2 & 1.3). Our code is available at https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.Comment: Accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI 2021

    Thermal and mechanical quantitative sensory testing in Chinese patients with burning mouth syndrome:a probable neuropathic pain condition?

    Get PDF
    BACKGROUND: To explore the hypothesis that burning mouth syndrome (BMS) probably is a neuropathic pain condition, thermal and mechanical sensory and pain thresholds were tested and compared with age- and gender-matched control participants using a standardized battery of psychophysical techniques. METHODS: Twenty-five BMS patients (men: 8, women: 17, age: 49.5 ± 11.4 years) and 19 age- and gender-matched healthy control participants were included. The cold detection threshold (CDT), warm detection threshold (WDT), cold pain threshold (CPT), heat pain threshold (HPT), mechanical detection threshold (MDT) and mechanical pain threshold (MPT), in accordance with the German Network of Neuropathic Pain guidelines, were measured at the following four sites: the dorsum of the left hand (hand), the skin at the mental foramen (chin), on the tip of the tongue (tongue), and the mucosa of the lower lip (lip). Statistical analysis was performed using ANOVA with repeated measures to compare the means within and between groups. Furthermore, Z-score profiles were generated, and exploratory correlation analyses between QST and clinical variables were performed. Two-tailed tests with a significance level of 5 % were used throughout. RESULTS: CDTs (P < 0.02) were significantly lower (less sensitivity) and HPTs (P < 0.001) were significantly higher (less sensitivity) at the tongue and lip in BMS patients compared to control participants. WDT (P = 0.007) was also significantly higher at the tongue in BMS patients compared to control subjects . There were no significant differences in MDT and MPT between the BMS patients and healthy subjects at any of the four test sites. Z-scores showed that significant loss of function can be identified for CDT (Z-scores = −0.9±1.1) and HPT (Z-scores = 1.5±0.4). There were no significant correlations between QST and clinical variables (pain intensity, duration, depressions scores). CONCLUSION: BMS patients had a significant loss of thermal function but not mechanical function, supporting the hypothesis that BMS may be a probable neuropathic pain condition. Further studies including e.g. electrophysiological or imaging techniques are needed to clarify the underlying mechanisms of BMS

    Online Video Instance Segmentation via Robust Context Fusion

    Full text link
    Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences. Recent transformer-based neural networks have demonstrated their powerful capability of modeling spatio-temporal correlations for the VIS task. Relying on video- or clip-level input, they suffer from high latency and computational cost. We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames. To acquire the precise and temporal-consistent prediction for each frame efficiently, the key idea is to fuse effective and compact context from reference frames into the target frame. Considering the different effects of reference and target frames on the target prediction, we first summarize contextual features through importance-aware compression. A transformer encoder is adopted to fuse the compressed context. Then, we leverage an order-preserving instance embedding to convey the identity-aware information and correspond the identities to predicted instance masks. We demonstrate that our robust fusion network achieves the best performance among existing online VIS methods and is even better than previously published clip-level methods on the Youtube-VIS 2019 and 2021 benchmarks. In addition, visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. By leveraging the flexibility of our context fusion network on multi-modal data, we further investigate the influence of audios on the video-dense prediction task, which has never been discussed in existing works. We build up an Audio-Visual Instance Segmentation dataset, and demonstrate that acoustic signals in the wild scenarios could benefit the VIS task
    • …
    corecore