810 research outputs found
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
First-person vision is gaining interest as it offers a unique viewpoint on
people's interaction with objects, their attention, and even intention.
However, progress in this challenging domain has been relatively slow due to
the lack of sufficiently large datasets. In this paper, we introduce
EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32
participants in their native kitchen environments. Our videos depict
nonscripted daily activities: we simply asked each participant to start
recording every time they entered their kitchen. Recording took place in 4
cities (in North America and Europe) by participants belonging to 10 different
nationalities, resulting in highly diverse cooking styles. Our dataset features
55 hours of video consisting of 11.5M frames, which we densely labeled for a
total of 39.6K action segments and 454.3K object bounding boxes. Our annotation
is unique in that we had the participants narrate their own videos (after
recording), thus reflecting true intention, and we crowd-sourced ground-truths
based on these. We describe our object, action and anticipation challenges, and
evaluate several baselines over two test splits, seen and unseen kitchens.
Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and
Project page: http://epic-kitchens.github.i
Còpia Temporal i Al·lucinació Local per Video inpainting
Video inpainting is the task of removing objects from videos. In particular, the goal is not only to fill every frame with plausible content but also to maintain a temporal consistency so that no abrupt changes can be perceived. The current state of the art in video inpainting, which builds upon deep neural network, suffers from the problem of handling large amounts of frames when working with decent resolution frames. In our work, we propose to tackle the problem of video inpainting by dividing it into two independent sub-tasks. The first, a Dense Flow Prediction Network (DFPN) capable of predicting the movement of the background by taking into account the movement of the object to remove. And the second, a Copy-and-Hallucinate Network (CHN) that uses the output of the previous network to copy the regions that are visible in reference frames while hallucinating those that are not. Both networks are trained independently and mixed using one of our three algorithm proposals: the Frame-by-Frame (FF) algorithm, the Inpaint-and-Propagate (IP) algorithm or the Copy-and-Propagate (CP) algorithm. We analyze our results by taking both an objective and a subjective approach in two different data sets. In both cases, we realize that our models are close to the state of the art but do not overpass it.Video inpainting es la tarea de borrar objetos de vídeos. En particular, el objetivo no es solo rellenar cada fotógrafa con contenido adecuado, sino también mantener cierta consistencia temporal para que no se perciban cambios abruptos. El estado del arte en video inpainting, basado en redes neuronales profundas, sufre del problema de gestionar grandes cantidades de fotogramas cuando se utilizan con resoluciones decentes. En nuestro trabajo, proponemos abordar el problema dividiéndolo en dos tareas independientes. La primera, una Dense Flow Prediction Network (DFPN) capaz de predecir el movimiento del fondo teniendo en cuenta el del objeto a eliminar. Y la segunda, una Copy-and-Hallucinate Network (CHN) que utiliza la salida del módulo anterior para copiar las regiones que son visibles en fotogramas auxiliares y alucinar aquellas que no lo son. Ambas redes son entrenadas independientemente y unidas utilizando una de nuestras tres propuestas de algoritmos: el algoritmo Frame-by-Frame (FF), el algoritmo Inpaint-and-Propagate (IP) o el algoritmo Copy-and-Propagate (CP). Analizamos nuestros resultados utilizando métodos objetivos y subjetivos en dos bases de datos diferentes. En ambos casos, concluimos que nuestros modelos están cerca del estado del arte pero no lo superan.Vídeo inpainting és la tasca d'esborrar objectes de vídeos. En particular, l'objectiu no és només omplir cada fotograma amb contingut adequat, sinó també mantenir certa consistència temporal per tal de no percebre canvis abruptes. L'estat de l'art en vídeo inpainting, basat en xarxes neuronals profundes, pateix del problema de gestionar grans quantitats de fotogrames quan aquests són de resolucions decents. En el nostre treball, proposem abordar el problema dividint-lo en dues tasques independents. La primera, una Dense Flow Prediction Network (DFPN) capaç de predir el moviment del fons tenint en compte el de l'objecte a eliminar. I la segona, una Copy-and-Hallucinate Network (CHN) que utilitza la sortida de la xarxa anterior per copiar les regions que són visibles en fotogrames auxiliars i al·lucinar les que no ho són. Les dues xarxes són entrenades independentment i unides fent servir una de les nostres tres propostes d'algoritmes: l'algoritme Frame-by-frame (FF), l'algoritme Inpaint-and-Propagate (IP) o l'algoritme Copy-and-Propagate (CP). Analitzem els nostres resultats utilitzant mètodes objectius i subjectius en dues bases de dades diferents. En tots dos casos, concloem que els nostres models s'apropen a l'estat de l'art però no el superen
3DFill:Reference-guided Image Inpainting by Self-supervised 3D Image Alignment
Most existing image inpainting algorithms are based on a single view,
struggling with large holes or the holes containing complicated scenes. Some
reference-guided algorithms fill the hole by referring to another viewpoint
image and use 2D image alignment. Due to the camera imaging process, simple 2D
transformation is difficult to achieve a satisfactory result. In this paper, we
propose 3DFill, a simple and efficient method for reference-guided image
inpainting. Given a target image with arbitrary hole regions and a reference
image from another viewpoint, the 3DFill first aligns the two images by a
two-stage method: 3D projection + 2D transformation, which has better results
than 2D image alignment. The 3D projection is an overall alignment between
images and the 2D transformation is a local alignment focused on the hole
region. The entire process of image alignment is self-supervised. We then fill
the hole in the target image with the contents of the aligned image. Finally,
we use a conditional generation network to refine the filled image to obtain
the inpainting result. 3DFill achieves state-of-the-art performance on image
inpainting across a variety of wide view shifts and has a faster inference
speed than other inpainting models
EGO-TOPO: Environment Affordances from Egocentric Video
First-person video naturally brings the use of a physical environment to the
forefront, since it shows the camera wearer interacting fluidly in a space
based on his intentions. However, current methods largely separate the observed
actions from the persistent space itself. We introduce a model for environment
affordances that is learned directly from egocentric video. The main idea is to
gain a human-centric model of a physical space (such as a kitchen) that
captures (1) the primary spatial zones of interaction and (2) the likely
activities they support. Our approach decomposes a space into a topological map
derived from first-person activity, organizing an ego-video into a series of
visits to the different zones. Further, we show how to link zones across
multiple related environments (e.g., from videos of multiple kitchens) to
obtain a consolidated representation of environment functionality. On
EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene
affordances and anticipating future actions in long-form video.Comment: Published in CVPR 2020, project page:
http://vision.cs.utexas.edu/projects/ego-topo
UVL: A Unified Framework for Video Tampering Localization
With the development of deep learning technology, various forgery methods
emerge endlessly. Meanwhile, methods to detect these fake videos have also
achieved excellent performance on some datasets. However, these methods suffer
from poor generalization to unknown videos and are inefficient for new forgery
methods. To address this challenging problem, we propose UVL, a novel unified
video tampering localization framework for synthesizing forgeries.
Specifically, UVL extracts common features of synthetic forgeries: boundary
artifacts of synthetic edges, unnatural distribution of generated pixels, and
noncorrelation between the forgery region and the original. These features are
widely present in different types of synthetic forgeries and help improve
generalization for detecting unknown videos. Extensive experiments on three
types of synthetic forgery: video inpainting, video splicing and DeepFake show
that the proposed UVL achieves state-of-the-art performance on various
benchmarks and outperforms existing methods by a large margin on cross-dataset
Learning Joint Spatial-Temporal Transformations for Video Inpainting
High-quality video inpainting that completes missing regions in video frames
is a promising yet challenging task. State-of-the-art approaches adopt
attention models to complete a frame by searching missing contents from
reference frames, and further complete whole videos frame by frame. However,
these approaches can suffer from inconsistent attention results along spatial
and temporal dimensions, which often leads to blurriness and temporal artifacts
in videos. In this paper, we propose to learn a joint Spatial-Temporal
Transformer Network (STTN) for video inpainting. Specifically, we
simultaneously fill missing regions in all input frames by self-attention, and
propose to optimize STTN by a spatial-temporal adversarial loss. To show the
superiority of the proposed model, we conduct both quantitative and qualitative
evaluations by using standard stationary masks and more realistic moving object
masks. Demo videos are available at https://github.com/researchmm/STTN.Comment: Accepted by ECCV202
- …