1 research outputs found

    End‐to‐end visual grounding via region proposal networks and bilinear pooling

    No full text
    Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts
    corecore