17 research outputs found

    Taking a look at small-scale pedestrians and occluded pedestrians

    Get PDF
    Small-scale pedestrian detection and occluded pedestrian detection are two challenging tasks. However, most state-of-the-art methods merely handle one single task each time, thus giving rise to relatively poor performance when the two tasks, in practice, are required simultaneously. In this paper, it is found that small-scale pedestrian detection and occluded pedestrian detection actually have a common problem, i.e., an inaccurate location problem. Therefore, solving this problem enables to improve the performance of both tasks. To this end, we pay more attention to the predicted bounding box with worse location precision and extract more contextual information around objects, where two modules (i.e., location bootstrap and semantic transition) are proposed. The location bootstrap is used to reweight regression loss, where the loss of the predicted bounding box far from the corresponding ground-truth is upweighted and the loss of the predicted bounding box near the corresponding ground-truth is downweighted. Additionally, the semantic transition adds more contextual information and relieves semantic inconsistency of the skip-layer fusion. Since the location bootstrap is not used at the test stage and the semantic transition is lightweight, the proposed method does not add many extra computational costs during inference. Experiments on the challenging CityPersons and Caltech datasets show that the proposed method outperforms the state-of-the-art methods on the small-scale pedestrians and occluded pedestrians (e.g., 5.20% and 4.73% improvements on the Caltech)

    Context-Aware Zero-Shot Recognition

    Full text link
    We present a novel problem setting in zero-shot learning, zero-shot object recognition and detection in the context. Contrary to the traditional zero-shot learning methods, which simply infers unseen categories by transferring knowledge from the objects belonging to semantically similar seen categories, we aim to understand the identity of the novel objects in an image surrounded by the known objects using the inter-object relation prior. Specifically, we leverage the visual context and the geometric relationships between all pairs of objects in a single image, and capture the information useful to infer unseen categories. We integrate our context-aware zero-shot learning framework into the traditional zero-shot learning techniques seamlessly using a Conditional Random Field (CRF). The proposed algorithm is evaluated on both zero-shot region classification and zero-shot detection tasks. The results on Visual Genome (VG) dataset show that our model significantly boosts performance with the additional visual context compared to traditional methods

    Contextual Object Detection with Multimodal Large Language Models

    Full text link
    Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: https://github.com/yuhangzang/ContextDET.Comment: Github: https://github.com/yuhangzang/ContextDET, Project Page: https://www.mmlab-ntu.com/project/contextdet/index.htm

    DeepScribe: Localization and Classification of Elamite Cuneiform Signs Via Deep Learning

    Full text link
    Twenty-five hundred years ago, the paperwork of the Achaemenid Empire was recorded on clay tablets. In 1933, archaeologists from the University of Chicago's Oriental Institute (OI) found tens of thousands of these tablets and fragments during the excavation of Persepolis. Many of these tablets have been painstakingly photographed and annotated by expert cuneiformists, and now provide a rich dataset consisting of over 5,000 annotated tablet images and 100,000 cuneiform sign bounding boxes. We leverage this dataset to develop DeepScribe, a modular computer vision pipeline capable of localizing cuneiform signs and providing suggestions for the identity of each sign. We investigate the difficulty of learning subtasks relevant to cuneiform tablet transcription on ground-truth data, finding that a RetinaNet object detector can achieve a localization mAP of 0.78 and a ResNet classifier can achieve a top-5 sign classification accuracy of 0.89. The end-to-end pipeline achieves a top-5 classification accuracy of 0.80. As part of the classification module, DeepScribe groups cuneiform signs into morphological clusters. We consider how this automatic clustering approach differs from the organization of standard, printed sign lists and what we may learn from it. These components, trained individually, are sufficient to produce a system that can analyze photos of cuneiform tablets from the Achaemenid period and provide useful transliteration suggestions to researchers. We evaluate the model's end-to-end performance on locating and classifying signs, providing a roadmap to a linguistically-aware transliteration system, then consider the model's potential utility when applied to other periods of cuneiform writing.Comment: Currently under review in the ACM JOCC
    corecore