762 research outputs found
Gongsun Longzi’s “form”: Minimal word meaning
Inspired by Gongsun Longzi’s “form-naming” idea about word meaning, this paper argues that 1) the internal lexicon contains only the list of word-meaning pairs, with no additional information either as part of word meaning or as a structural level above it; 2) the meaning of word is a minimal C-Form, the identifying conceptual meaning that individuates a concept; 3) C-Form is the interface between word meaning and concept meaning; and 4) a sentence has a minimal semantic content, consisting of the minimal meanings of the words composing it, which is propositional and truth-evaluable, and contextual elements contribute nothing to the meaning of language expressions. This paper adheres to semantic minimalism, believing meanwhile that meaning holism helps in semantics inquiry, since reflection on language meaning differs from language meaning itself.Â
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Multimodal transformer exhibits high capacity and flexibility to align image
and text for visual grounding. However, the existing encoder-only grounding
framework (e.g., TransVG) suffers from heavy computation due to the
self-attention operation with quadratic time complexity. To address this issue,
we present a new multimodal transformer architecture, coined as Dynamic
Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into
encoding and decoding phases. The key observation is that there exists high
spatial redundancy in images. Thus, we devise a new dynamic multimodal
transformer decoder by exploiting this sparsity prior to speed up the visual
grounding process. Specifically, our dynamic decoder is composed of a 2D
adaptive sampling module and a text guided decoding module. The sampling module
aims to select these informative patches by predicting the offsets with respect
to a reference point, while the decoding module works for extracting the
grounded object information by performing cross attention between image
features and text features. These two modules are stacked alternatively to
gradually bridge the modality gap and iteratively refine the reference point of
grounded object, eventually realizing the objective of visual grounding.
Extensive experiments on five benchmarks demonstrate that our proposed Dynamic
MDETR achieves competitive trade-offs between computation and accuracy.
Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs
of the multimodal transformer, but still get higher accuracy than the
encoder-only counterpart. In addition, to verify its generalization ability and
scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual
grounding framework, and achieve the state-of-the-art performance on these
benchmarks.Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI) in October 202
A Supramolecular Strategy to Assemble Multifunctional Viral Nanoparticles
Using a one-pot approach driven by the supramolecular interaction between β-cyclodextrin and adamantyl moieties, multifunctional viral nanoparticles can be facilely formulated for biomedical applications
- …