26,707 research outputs found
Refining Vision Videos
[Context and motivation] Complex software-based systems involve several
stakeholders, their activities and interactions with the system. Vision videos
are used during the early phases of a project to complement textual
representations. They visualize previously abstract visions of the product and
its use. By creating, elaborating, and discussing vision videos, stakeholders
and developers gain an improved shared understanding of how those abstract
visions could translate into concrete scenarios and requirements to which
individuals can relate. [Question/problem] In this paper, we investigate two
aspects of refining vision videos: (1) Refining the vision by providing
alternative answers to previously open issues about the system to be built. (2)
A refined understanding of the camera perspective in vision videos. The impact
of using a subjective (or "ego") perspective is compared to the usual
third-person perspective. [Methodology] We use shopping in rural areas as a
real-world application domain for refining vision videos. Both aspects of
refining vision videos were investigated in an experiment with 20 participants.
[Contribution] Subjects made a significant number of additional contributions
when they had received not only video or text but also both - even with very
short text and short video clips. Subjective video elements were rated as
positive. However, there was no significant preference for either subjective or
non-subjective videos in general.Comment: 15 pages, 25th International Working Conference on Requirements
Engineering: Foundation for Software Quality 201
MoSculp: Interactive Visualization of Shape and Time
We present a system that allows users to visualize complex human motion via
3D motion sculptures---a representation that conveys the 3D structure swept by
a human body as it moves through space. Given an input video, our system
computes the motion sculptures and provides a user interface for rendering it
in different styles, including the options to insert the sculpture back into
the original video, render it in a synthetic scene or physically print it.
To provide this end-to-end workflow, we introduce an algorithm that estimates
that human's 3D geometry over time from a set of 2D images and develop a
3D-aware image-based rendering approach that embeds the sculpture back into the
scene. By automating the process, our system takes motion sculpture creation
out of the realm of professional artists, and makes it applicable to a wide
range of existing video material.
By providing viewers with 3D information, motion sculptures reveal space-time
motion information that is difficult to perceive with the naked eye, and allow
viewers to interpret how different parts of the object interact over time. We
validate the effectiveness of this approach with user studies, finding that our
motion sculpture visualizations are significantly more informative about motion
than existing stroboscopic and space-time visualization methods.Comment: UIST 2018. Project page: http://mosculp.csail.mit.edu
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
Temporal action localization is an important yet challenging problem. Given a
long, untrimmed video consisting of multiple action instances and complex
background contents, we need not only to recognize their action categories, but
also to localize the start time and end time of each instance. Many
state-of-the-art systems use segment-level classifiers to select and rank
proposal segments of pre-determined boundaries. However, a desirable model
should move beyond segment-level and make dense predictions at a fine
granularity in time to determine precise temporal boundaries. To this end, we
design a novel Convolutional-De-Convolutional (CDC) network that places CDC
filters on top of 3D ConvNets, which have been shown to be effective for
abstracting action semantics but reduce the temporal length of the input data.
The proposed CDC filter performs the required temporal upsampling and spatial
downsampling operations simultaneously to predict actions at the frame-level
granularity. It is unique in jointly modeling action semantics in space-time
and fine-grained temporal dynamics. We train the CDC network in an end-to-end
manner efficiently. Our model not only achieves superior performance in
detecting actions in every frame, but also significantly boosts the precision
of localizing temporal boundaries. Finally, the CDC network demonstrates a very
high efficiency with the ability to process 500 frames per second on a single
GPU server. We will update the camera-ready version and publish the source
codes online soon.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
201
- …