5 research outputs found

    Multi-modal classifiers for open-vocabulary object detection

    Get PDF
    The goal of this paper is open-vocabulary object detection (OVOD) – building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector

    Multi-Modal Classifiers for Open-Vocabulary Object Detection

    Full text link
    The goal of this paper is open-vocabulary object detection (OVOD) \unicode{x2013} building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.Comment: ICML 2023, project page: https://www.robots.ox.ac.uk/vgg/research/mm-ovod

    RSS-Net: Weakly-Supervised Multi-Class Semantic Segmentation with FMCW Radar

    Full text link
    This paper presents an efficient annotation procedure and an application thereof to end-to-end, rich semantic segmentation of the sensed environment using FMCW scanning radar. We advocate radar over the traditional sensors used for this task as it operates at longer ranges and is substantially more robust to adverse weather and illumination conditions. We avoid laborious manual labelling by exploiting the largest radar-focused urban autonomy dataset collected to date, correlating radar scans with RGB cameras and LiDAR sensors, for which semantic segmentation is an already consolidated procedure. The training procedure leverages a state-of-the-art natural image segmentation system which is publicly available and as such, in contrast to previous approaches, allows for the production of copious labels for the radar stream by incorporating four camera and two LiDAR streams. Additionally, the losses are computed taking into account labels to the radar sensor horizon by accumulating LiDAR returns along a pose-chain ahead and behind of the current vehicle position. Finally, we present the network with multi-channel radar scan inputs in order to deal with ephemeral and dynamic scene objects.Comment: submitted to IEEE Intelligent Vehicles Symposium (IV) 202

    The Optical and RF Analysis of a Second Generation Integrated Optical Cancellation System

    No full text
    With ubiquitous computing and the 'internet of things' increasingly becoming modern day realities, the demand for wireless communication links is exponentially increasing. With this increasing density of wireless communication links comes the issue of managing the physical communication channel, the RF spectrum. The balanced optical cancellation system (BOCS) analyzed in this thesis serves as a potential solution to drastically increase the spectral efficiency of modern wireless communications. The BOCS acts to cancel self-interference in transceiver systems, by utilizing knowledge of the nature of the unwanted self-interference signal. The system presented here was designed in the Lightwave Communications Laboratory and produced by the Fraunhofer Heinrich Hertz Institute through the JePPIX MPW services. The BOCS was fabricated in the HHI06 fabrication run at JePPIX MPW alongside another cancellation system called MTAP. This report describes the problem of self-interference and how cancellation can be performed. A brief outline of the overarching field of microwave photonics (MWP) is given to introduce the reader to the category of system the BOCS falls into. A review of previous work regarding photonics cancellation systems is provided to give context to the reader. A detailed overview of the system with explanation of the key benefits of the BOCS is followed by defining key performance metric and deriving their analytical expressions using system parameters. An analysis of the devices used in the BOCS is given through a series of experiments. This thesis concludes with an analysis of the overall performance of the BOCS, using experimental results and simulations

    Developing object perception in the low data regime

    No full text
    Objects are central to human perception and understanding of the world. There is an abundance of images available on the internet covering the vast number of objects in the world, however, labelling these images exhaustively to cover all objects is infeasible—limiting the utility of systems requiring strong supervision through large labelled datasets. To address this issue, this thesis develops methods to enable novel objects to be learnt with limited use of manually labelled data. First, we consider the problem of few-shot object detection, which is the problem of learning to expand the set of objects which can be detected with only a few manually labelled examples. We show that the few examples available for novel categories can be used to accurately pseudo-label existing data to yield a large number of novel pseudo-annotations for further detector training. Second, we address the more challenging problem of open-vocabulary object detection, which requires learning to detect novel object categories with no annotated data. We demonstrate the utility of detailed natural language descriptions to provide additional visual information for novel object detection. Moreover, we show that visual exemplars can be aggregated and combined with object descriptions to yield multi-modal classifiers for superior novel object detection. Finally, we consider the problem of object hallucinations in large vision-language models. We propose an automatic method to evaluate the presence of object hallucinations in detailed natural language descriptions of images generated by large vision-language models. We make use of language models and labelled detection data to automatically and robustly analyse the presence of object hallucinations in generated descriptions
    corecore