687 research outputs found

    Vision and language understanding with localized evidence

    Full text link
    Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research

    Multilevel Language and Vision Integration for Text-to-Clip Retrieval

    Full text link
    We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.Comment: AAAI 201

    Infinitely many homoclinic solutions for a class of damped vibration problems

    Get PDF
    In this paper, we consider the multiplicity of homoclinic solutions for the following damped vibration problems x¨(t) + Bx˙(t) − A(t)x(t) + Hx(t, x(t)) = 0, where A(t) ∈ (R, RN) is a symmetric matrix for all t ∈ R, B = [bij] is an antisymmetric N × N constant matrix, and H(t, x) ∈ C 1 (R × Bδ , R) is only locally defined near the origin in x for some δ > 0. With the nonlinearity H(t, x) being partially sub-quadratic at zero, we obtain infinitely many homoclinic solutions near the origin by using a Clark’s theorem

    The progress of clinical research on the detection of 1,5-anhydroglucitol in diabetes and its complications

    Get PDF
    1,5-Anhydroglucitol (1,5-AG) is sensitive to short-term glucose fluctuations and postprandial hyperglycemia, which has great potential in the clinical application of diabetes as a nontraditional blood glucose monitoring indicator. A large number of studies have found that 1,5-AG can be used to screen for diabetes, manage diabetes, and predict the perils of diabetes complications (diabetic nephropathy, diabetic cardiovascular disease, diabetic retinopathy, diabetic pregnancy complications, diabetic peripheral neuropathy, etc.). Additionally, 1,5-AG and β cells are also associated with each other. As a noninvasive blood glucose monitoring indicator, salivary 1,5-AG has much more benefit for clinical application; however, it cannot be ignored that its detection methods are not perfect. Thus, a considerable stack of research is still needed to establish an accurate and simple enzyme assay for the detection of salivary 1,5-AG. More clinical studies will also be required in the future to confirm the normal reference range of 1,5-AG and its role in diabetes complications to further enhance the blood glucose monitoring system for diabetes
    • …
    corecore