698 research outputs found

    Neural Baby Talk

    Full text link
    We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence `template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the regions by object detectors. The entire architecture (sentence template generation and slot filling with object detectors) is end-to-end differentiable. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both COCO and Flickr30k datasets. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions -- and hence language priors of associated captions -- are different. Code has been made available at: https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201

    AI Modeling Approaches for Detecting, Characterizing, and Predicting Brief Daily Behaviors such as Toothbrushing using Wrist Trackers.

    Get PDF
    Continuous advancements in wrist-worn sensors have opened up exciting possibilities for real-time monitoring of individuals\u27 daily behaviors, with the aim of promoting healthier, more organized, and efficient lives. Understanding the duration of specific daily behaviors has become of interest to individuals seeking to optimize their lifestyles. However, there is still a research gap when it comes to monitoring short-duration behaviors that have a significant impact on health using wrist-worn inertial sensors in natural environments. These behaviors often involve repetitive micro-events that last only a few seconds or even microseconds, making their detection and analysis challenging. Furthermore, these micro-events are often surrounded by non-repetitive boundary events, further complicating the identification process. Effective detection and timely intervention during these short-duration behaviors are crucial for designing personalized interventions that can positively impact individuals\u27 lifestyles. To address these challenges, this dissertation introduces three models: mORAL, mTeeth, and Brushing Prompt. These models leverage wrist-worn inertial sensors to accurately infer short-duration behaviors, identify repetitive micro-behaviors, and provide timely interventions related to oral hygiene. The dissertation\u27s contributions extend beyond the development of these models. Firstly, precise and detailed labels for each brief and micro-repetitive behavior are acquired to train and validate the models effectively. This involved meticulous marking of the exact start and end times of each event, including any intervening pauses, at a second-level granularity. A comprehensive scientific research study was conducted to collect such data from participants in their free-living natural environments. Secondly, a solution is proposed to address the issue of sensor placement variability. Given the different positions of the sensor within a wristband and variations in wristband placement on the wrist, the model needs to determine the relative configuration of the inertial sensor accurately. Accurately determining the relative positioning of the inertial sensor with respect to the wrist is crucial for the model to determine the orientation of the hand. Additionally, time synchronization errors between sensor data and associated video, despite both being collected on the same smartphone, are addressed through the development of an algorithm that tightly synchronizes the two data sources without relying on an explicit anchor event. Furthermore, an event-based approach is introduced to identify candidate segments of data for applying machine learning models, outperforming the traditional fixed window-based approach. These candidate segments enable reliable detection of brief daily behaviors in a computationally efficient manner suitable for real-time. The dissertation also presents a computationally lightweight method for identifying anchor events using wrist-worn inertial sensors. Anchor events play a vital role in assigning unambiguous labels in a fixed-length window-based approach to data segmentation and effectively demarcating transitions between micro-repetitive events. Significant features are extracted, and explainable machine learning models are developed to ensure reliable detection of brief daily and micro-repetitive behaviors. Lastly, the dissertation addresses the crucial factor of the opportune moment for intervention during brief daily behaviors using wrist-worn inertial sensors. By leveraging these sensors, users can receive timely and personalized interventions to enhance their performance and improve their lifestyles. Overall, this dissertation makes substantial contributions to the field of real-time monitoring of short-duration behaviors. It tackles various technical challenges, provides innovative solutions, and demonstrates the potential for wrist-worn sensors to facilitate effective interventions and promote healthier behaviors. By advancing our understanding of these behaviors and optimizing intervention strategies, this research has the potential to significantly impact individuals\u27 well-being and contribute to the development of personalized health solutions

    Improving Domain Generalization by Learning without Forgetting: Application in Retail Checkout

    Full text link
    Designing an automatic checkout system for retail stores at the human level accuracy is challenging due to similar appearance products and their various poses. This paper addresses the problem by proposing a method with a two-stage pipeline. The first stage detects class-agnostic items, and the second one is dedicated to classify product categories. We also track the objects across video frames to avoid duplicated counting. One major challenge is the domain gap because the models are trained on synthetic data but tested on the real images. To reduce the error gap, we adopt domain generalization methods for the first-stage detector. In addition, model ensemble is used to enhance the robustness of the 2nd-stage classifier. The method is evaluated on the AI City challenge 2022 -- Track 4 and gets the F1 score 40%40\% on the test A set. Code is released at the link https://github.com/cybercore-co-ltd/aicity22-track4

    Segmenting Known Objects and Unseen Unknowns without Prior Knowledge

    Full text link
    Panoptic segmentation methods assign a known class to each pixel given in input. Even for state-of-the-art approaches, this inevitably enforces decisions that systematically lead to wrong predictions for objects outside the training categories. However, robustness against out-of-distribution samples and corner cases is crucial in safety-critical settings to avoid dangerous consequences. Since real-world datasets cannot contain enough data points to adequately sample the long tail of the underlying distribution, models must be able to deal with unseen and unknown scenarios as well. Previous methods targeted this by re-identifying already-seen unlabeled objects. In this work, we propose the necessary step to extend segmentation with a new setting which we term holistic segmentation. Holistic segmentation aims to identify and separate objects of unseen unknown categories into instances, without any prior knowledge about them, while performing panoptic segmentation of known classes. We tackle this new problem with U3HS, which finds unknowns as highly uncertain regions and clusters their corresponding instance-aware embeddings into individual objects. By doing so, for the first time in panoptic segmentation with unknown objects, our U3HS is trained without unknown categories, reducing assumptions and leaving the settings as unconstrained as in real-life scenarios. Extensive experiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate the effectiveness of U3HS for this new, challenging, and assumptions-free setting called holistic segmentation.Comment: Accepted at ICCV 202

    Grounded Language-Image Pre-training

    Full text link
    This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.Comment: CVPR 2022; updated visualizations; fixed hyper-parameters in Appendix C.
    • …
    corecore