53,886 research outputs found
ForecastTKGQuestions: A Benchmark for Temporal Question Answering and Forecasting over Temporal Knowledge Graphs
Question answering over temporal knowledge graphs (TKGQA) has recently found
increasing interest. TKGQA requires temporal reasoning techniques to extract
the relevant information from temporal knowledge bases. The only existing TKGQA
dataset, i.e., CronQuestions, consists of temporal questions based on the facts
from a fixed time period, where a temporal knowledge graph (TKG) spanning the
same period can be fully used for answer inference, allowing the TKGQA models
to use even the future knowledge to answer the questions based on the past
facts. In real-world scenarios, however, it is also common that given the
knowledge until now, we wish the TKGQA systems to answer the questions asking
about the future. As humans constantly seek plans for the future, building
TKGQA systems for answering such forecasting questions is important.
Nevertheless, this has still been unexplored in previous research. In this
paper, we propose a novel task: forecasting question answering over temporal
knowledge graphs. We also propose a large-scale TKGQA benchmark dataset, i.e.,
ForecastTKGQuestions, for this task. It includes three types of questions,
i.e., entity prediction, yes-no, and fact reasoning questions. For every
forecasting question in our dataset, QA models can only have access to the TKG
information before the timestamp annotated in the given question for answer
inference. We find that the state-of-the-art TKGQA methods perform poorly on
forecasting questions, and they are unable to answer yes-no questions and fact
reasoning questions. To this end, we propose ForecastTKGQA, a TKGQA model that
employs a TKG forecasting module for future inference, to answer all three
types of questions. Experimental results show that ForecastTKGQA outperforms
recent TKGQA methods on the entity prediction questions, and it also shows
great effectiveness in answering the other two types of questions.Comment: Accepted to ISWC 202
Vision and language understanding with localized evidence
Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval.
First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals.
In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
- …