2,663 research outputs found
Tied Block Convolution: Leaner and Better CNNs with Shared Thinner Filters
Convolution is the main building block of convolutional neural networks
(CNN). We observe that an optimized CNN often has highly correlated filters as
the number of channels increases with depth, reducing the expressive power of
feature representations. We propose Tied Block Convolution (TBC) that shares
the same thinner filters over equal blocks of channels and produces multiple
responses with a single filter. The concept of TBC can also be extended to
group convolution and fully connected layers, and can be applied to various
backbone networks and attention modules. Our extensive experimentation on
classification, detection, instance segmentation, and attention demonstrates
TBC's significant across-the-board gain over standard convolution and group
convolution. The proposed TiedSE attention module can even use 64 times fewer
parameters than the SE module to achieve comparable performance. In particular,
standard CNNs often fail to accurately aggregate information in the presence of
occlusion and result in multiple redundant partial object proposals. By sharing
filters across channels, TBC reduces correlation and can effectively handle
highly overlapping instances. TBC increases the average precision for object
detection on MS-COCO by 6% when the occlusion ratio is 80%. Our code will be
released.Comment: 13 page
Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models
Existing building recognition methods, exemplified by BRAILS, utilize
supervised learning to extract information from satellite and street-view
images for classification and segmentation. However, each task module requires
human-annotated data, hindering the scalability and robustness to regional
variations and annotation imbalances. In response, we propose a new zero-shot
workflow for building attribute extraction that utilizes large-scale vision and
language models to mitigate reliance on external annotations. The proposed
workflow contains two key components: image-level captioning and segment-level
captioning for the building images based on the vocabularies pertinent to
structural and civil engineering. These two components generate descriptive
captions by computing feature representations of the image and the
vocabularies, and facilitating a semantic match between the visual and textual
representations. Consequently, our framework offers a promising avenue to
enhance AI-driven captioning for building attribute extraction in the
structural and civil engineering domains, ultimately reducing reliance on human
annotations while bolstering performance and adaptability.Comment: Accepted to WACV 2024, Project Page:
https://sites.google.com/view/zobae/hom
- …