1 research outputs found
Endâtoâend visual grounding via region proposal networks and bilinear pooling
Phraseâbased visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a twoâstage mechanism to address this problem: first, an offâtheâshelf proposal generation model is adopted to extract regionâbased visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an endâtoâend approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multiâmodal factorised bilinear pooling model to fuse the multiâmodal features effectively. After that, two novel losses are posed on top of the multiâmodal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three realâworld visual grounding datasets, namely Flickrâ30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing stateâofâtheâarts