121,039 research outputs found
Deep Learning for Semantic Video Understanding
The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video.
This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
Ontology For Europe's Space Situational Awareness Program
This paper presents an ontology architecture concept for the European Space Agencyâs (ESA) Space Situational Awareness (SSA) Program. It incorporates the authorâs domain ontology, The Space Situational Awareness Ontology and related ontology work. I summarize computational ontology, discuss the segments of ESA SSA, and introduce an option for a modular ontology framework reflecting the divisionsof the SSA program. Among other things, ontologies are used for data sharing and integration. By applying ontology to ESA data, the ESA may better achieve its integration and innovation goals, while simultaneously improving the state of peaceful SSA
Specification Patterns for Robotic Missions
Mobile and general-purpose robots increasingly support our everyday life,
requiring dependable robotics control software. Creating such software mainly
amounts to implementing their complex behaviors known as missions. Recognizing
the need, a large number of domain-specific specification languages has been
proposed. These, in addition to traditional logical languages, allow the use of
formally specified missions for synthesis, verification, simulation, or guiding
the implementation. For instance, the logical language LTL is commonly used by
experts to specify missions, as an input for planners, which synthesize the
behavior a robot should have. Unfortunately, domain-specific languages are
usually tied to specific robot models, while logical languages such as LTL are
difficult to use by non-experts. We present a catalog of 22 mission
specification patterns for mobile robots, together with tooling for
instantiating, composing, and compiling the patterns to create mission
specifications. The patterns provide solutions for recurrent specification
problems, each of which detailing the usage intent, known uses, relationships
to other patterns, and---most importantly---a template mission specification in
temporal logic. Our tooling produces specifications expressed in the LTL and
CTL temporal logics to be used by planners, simulators, or model checkers. The
patterns originate from 245 realistic textual mission requirements extracted
from the robotics literature, and they are evaluated upon a total of 441
real-world mission requirements and 1251 mission specifications. Five of these
reflect scenarios we defined with two well-known industrial partners developing
human-size robots. We validated our patterns' correctness with simulators and
two real robots
Video semantic content analysis framework based on ontology combined MPEG-7
The rapid increase in the available amount of video data is creating a growing demand for efficient methods for understanding and managing it at the semantic level. New multimedia standard, MPEG-7, provides the rich functionalities to enable the generation of audiovisual descriptions and is expressed solely in XML Schema which provides little support for expressing semantic knowledge. In this paper, a video semantic content analysis framework based on ontology combined MPEG-7 is presented. Domain
ontology is used to define high level semantic concepts and their relations in the context of the examined domain. MPEG-7 metadata terms of audiovisual descriptions and video content analysis algorithms are expressed in this ontology to enrich video semantic analysis. OWL is used for the ontology description. Rules in Description Logic are defined to describe how low-level features and algorithms for video analysis should be applied according to different perception content. Temporal Description Logic is used to describe the
semantic events, and a reasoning algorithm is proposed for events detection. The proposed framework is demonstrated in sports video domain and shows promising results
- âŠ