Search CORE

185 research outputs found

SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval

Author: Han Jungong
Ji Zhong
Pang Yanwei
Wang Haoran
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2022
Field of study

This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods

Warwick Research Archives Portal Repository

Deep attentive video summarization with distribution consistency learning

Author: Han Jungong
Ji Zhong
Li Xi
Pang Yanwei
Zhao Yuxiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2021
Field of study

This article studies supervised video summarization by formulating it into a sequence-to-sequence learning framework, in which the input and output are sequences of original video frames and their predicted importance scores, respectively. Two critical issues are addressed in this article: short-term contextual attention insufficiency and distribution inconsistency. The former lies in the insufficiency of capturing the short-term contextual attention information within the video sequence itself since the existing approaches focus a lot on the long-term encoder-decoder attention. The latter refers to the distributions of predicted importance score sequence and the ground-truth sequence is inconsistent, which may lead to a suboptimal solution. To better mitigate the first issue, we incorporate a self-attention mechanism in the encoder to highlight the important keyframes in a short-term context. The proposed approach alongside the encoder-decoder attention constitutes our deep attentive models for video summarization. For the second one, we propose a distribution consistency learning method by employing a simple yet effective regularization loss term, which seeks a consistent distribution for the two sequences. Our final approach is dubbed as Attentive and Distribution consistent video Summarization (ADSum). Extensive experiments on benchmark data sets demonstrate the superiority of the proposed ADSum approach against state-of-the-art approaches

Warwick Research Archives Portal Repository

Taking a look at small-scale pedestrians and occluded pedestrians

Author: Cao Jiale
Gao Bolin
Han Jungong
Li Xuelong
Pang Yanwei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2019
Field of study

Small-scale pedestrian detection and occluded pedestrian detection are two challenging tasks. However, most state-of-the-art methods merely handle one single task each time, thus giving rise to relatively poor performance when the two tasks, in practice, are required simultaneously. In this paper, it is found that small-scale pedestrian detection and occluded pedestrian detection actually have a common problem, i.e., an inaccurate location problem. Therefore, solving this problem enables to improve the performance of both tasks. To this end, we pay more attention to the predicted bounding box with worse location precision and extract more contextual information around objects, where two modules (i.e., location bootstrap and semantic transition) are proposed. The location bootstrap is used to reweight regression loss, where the loss of the predicted bounding box far from the corresponding ground-truth is upweighted and the loss of the predicted bounding box near the corresponding ground-truth is downweighted. Additionally, the semantic transition adds more contextual information and relieves semantic inconsistency of the skip-layer fusion. Since the location bootstrap is not used at the test stage and the semantic transition is lightweight, the proposed method does not add many extra computational costs during inference. Experiments on the challenging CityPersons and Caltech datasets show that the proposed method outperforms the state-of-the-art methods on the small-scale pedestrians and occluded pedestrians (e.g., 5.20% and 4.73% improvements on the Caltech)

Warwick Research Archives Portal Repository

JCS-Net : joint classification and super-resolution network for small-scale pedestrian detection in surveillance images

Author: Cao Jiale
Han Jungong
Pang Yanwei
Wang Jian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2019
Field of study

While Convolutional Neural Network (CNN)-based pedestrian detection methods have proven to be successful in various applications, detecting small-scale pedestrian from surveillance images is still challenging.The major reason is that the small-scale pedestrians lack much detailed information compared to the large-scale pedestrians. To solve this problem, we propose to utilize the relationship between the large-scale pedestrians and the corresponding small-scale pedestrians to help recover the detailed information of the small-scale pedestrians, thus improving the performance of detecting small-scale pedestrians. Specifically, a unified network (called JCS-Net) is proposed for small-scale pedestrian detection, which integrates the classification task and the super-resolution task in a unified framework. As a result, the super-resolution and classification are fully engaged and the super-resolution sub-network can recover some useful detailed information for the subsequent classification. Based on HOG+LUV and JCS-Net, multi-layer channel features (MCF) are constructed to train the detector. Experimental results on the Caltech pedestrian dataset and the KITTI benchmark demonstrate the effectiveness of the proposed method. To further enhance the detection, multi-scale MCF based on JCS-Net for pedestrian detection is also proposed, which achieves the state-of-the-art performance

Warwick Research Archives Portal Repository

Hierarchical shot detector

Author: Cao Jiale
Han Jungong
Li Xuelong
Pang Yanwei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/02/2020
Field of study

Single shot detector simultaneously predicts object categories and regression offsets of the default boxes. Despite of high efficiency, this structure has some inappropriate designs: (1) The classification result of the default box is improperly assigned to that of the regressed box during inference, (2) Only regression once is not good enough for accurate object detection. To solve the first problem, a novel reg-offset-cls (ROC) module is proposed. It contains three hierarchical steps: box regression, the feature sampling location predication, and the regressed box classification with the features of offset locations. To further solve the second problem, a hierarchical shot detector (HSD) is proposed, which stacks two ROC modules and one feature enhanced module. The second ROC treats the regressed boxes and the feature sampling locations of features in the first ROC as the inputs. Meanwhile, the feature enhanced module injected between two ROCs aims to extract the local and non-local context. Experiments on the MS COCO and PASCAL VOC datasets demonstrate the superiority of proposed HSD. Without the bells or whistles, HSD outperforms all one-stage methods at real-time speed

Crossref

Warwick Research Archives Portal Repository

A Case Report and Literature Review of Small Intestinal Metastasis of Large Cell Lung Cancer

Author: Lizhi ZHANG
Tao ZHOU
Xiaoyu HAN
Yanwei LIU
Publication venue: Chinese Anti-Cancer Association; Chinese Antituberculosis Association
Publication date: 01/06/2010
Field of study