810 research outputs found

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Get PDF
    First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.i

    Còpia Temporal i Al·lucinació Local per Video inpainting

    Get PDF
    Video inpainting is the task of removing objects from videos. In particular, the goal is not only to fill every frame with plausible content but also to maintain a temporal consistency so that no abrupt changes can be perceived. The current state of the art in video inpainting, which builds upon deep neural network, suffers from the problem of handling large amounts of frames when working with decent resolution frames. In our work, we propose to tackle the problem of video inpainting by dividing it into two independent sub-tasks. The first, a Dense Flow Prediction Network (DFPN) capable of predicting the movement of the background by taking into account the movement of the object to remove. And the second, a Copy-and-Hallucinate Network (CHN) that uses the output of the previous network to copy the regions that are visible in reference frames while hallucinating those that are not. Both networks are trained independently and mixed using one of our three algorithm proposals: the Frame-by-Frame (FF) algorithm, the Inpaint-and-Propagate (IP) algorithm or the Copy-and-Propagate (CP) algorithm. We analyze our results by taking both an objective and a subjective approach in two different data sets. In both cases, we realize that our models are close to the state of the art but do not overpass it.Video inpainting es la tarea de borrar objetos de vídeos. En particular, el objetivo no es solo rellenar cada fotógrafa con contenido adecuado, sino también mantener cierta consistencia temporal para que no se perciban cambios abruptos. El estado del arte en video inpainting, basado en redes neuronales profundas, sufre del problema de gestionar grandes cantidades de fotogramas cuando se utilizan con resoluciones decentes. En nuestro trabajo, proponemos abordar el problema dividiéndolo en dos tareas independientes. La primera, una Dense Flow Prediction Network (DFPN) capaz de predecir el movimiento del fondo teniendo en cuenta el del objeto a eliminar. Y la segunda, una Copy-and-Hallucinate Network (CHN) que utiliza la salida del módulo anterior para copiar las regiones que son visibles en fotogramas auxiliares y alucinar aquellas que no lo son. Ambas redes son entrenadas independientemente y unidas utilizando una de nuestras tres propuestas de algoritmos: el algoritmo Frame-by-Frame (FF), el algoritmo Inpaint-and-Propagate (IP) o el algoritmo Copy-and-Propagate (CP). Analizamos nuestros resultados utilizando métodos objetivos y subjetivos en dos bases de datos diferentes. En ambos casos, concluimos que nuestros modelos están cerca del estado del arte pero no lo superan.Vídeo inpainting és la tasca d'esborrar objectes de vídeos. En particular, l'objectiu no és només omplir cada fotograma amb contingut adequat, sinó també mantenir certa consistència temporal per tal de no percebre canvis abruptes. L'estat de l'art en vídeo inpainting, basat en xarxes neuronals profundes, pateix del problema de gestionar grans quantitats de fotogrames quan aquests són de resolucions decents. En el nostre treball, proposem abordar el problema dividint-lo en dues tasques independents. La primera, una Dense Flow Prediction Network (DFPN) capaç de predir el moviment del fons tenint en compte el de l'objecte a eliminar. I la segona, una Copy-and-Hallucinate Network (CHN) que utilitza la sortida de la xarxa anterior per copiar les regions que són visibles en fotogrames auxiliars i al·lucinar les que no ho són. Les dues xarxes són entrenades independentment i unides fent servir una de les nostres tres propostes d'algoritmes: l'algoritme Frame-by-frame (FF), l'algoritme Inpaint-and-Propagate (IP) o l'algoritme Copy-and-Propagate (CP). Analitzem els nostres resultats utilitzant mètodes objectius i subjectius en dues bases de dades diferents. En tots dos casos, concloem que els nostres models s'apropen a l'estat de l'art però no el superen

    3DFill:Reference-guided Image Inpainting by Self-supervised 3D Image Alignment

    Full text link
    Most existing image inpainting algorithms are based on a single view, struggling with large holes or the holes containing complicated scenes. Some reference-guided algorithms fill the hole by referring to another viewpoint image and use 2D image alignment. Due to the camera imaging process, simple 2D transformation is difficult to achieve a satisfactory result. In this paper, we propose 3DFill, a simple and efficient method for reference-guided image inpainting. Given a target image with arbitrary hole regions and a reference image from another viewpoint, the 3DFill first aligns the two images by a two-stage method: 3D projection + 2D transformation, which has better results than 2D image alignment. The 3D projection is an overall alignment between images and the 2D transformation is a local alignment focused on the hole region. The entire process of image alignment is self-supervised. We then fill the hole in the target image with the contents of the aligned image. Finally, we use a conditional generation network to refine the filled image to obtain the inpainting result. 3DFill achieves state-of-the-art performance on image inpainting across a variety of wide view shifts and has a faster inference speed than other inpainting models

    EGO-TOPO: Environment Affordances from Egocentric Video

    Full text link
    First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.Comment: Published in CVPR 2020, project page: http://vision.cs.utexas.edu/projects/ego-topo

    UVL: A Unified Framework for Video Tampering Localization

    Full text link
    With the development of deep learning technology, various forgery methods emerge endlessly. Meanwhile, methods to detect these fake videos have also achieved excellent performance on some datasets. However, these methods suffer from poor generalization to unknown videos and are inefficient for new forgery methods. To address this challenging problem, we propose UVL, a novel unified video tampering localization framework for synthesizing forgeries. Specifically, UVL extracts common features of synthetic forgeries: boundary artifacts of synthetic edges, unnatural distribution of generated pixels, and noncorrelation between the forgery region and the original. These features are widely present in different types of synthetic forgeries and help improve generalization for detecting unknown videos. Extensive experiments on three types of synthetic forgery: video inpainting, video splicing and DeepFake show that the proposed UVL achieves state-of-the-art performance on various benchmarks and outperforms existing methods by a large margin on cross-dataset

    Learning Joint Spatial-Temporal Transformations for Video Inpainting

    Full text link
    High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.Comment: Accepted by ECCV202
    corecore