4 research outputs found

    Visual Summary of Egocentric Photostreams by Representative Keyframes

    Get PDF
    Building a visual summary from an egocentric photostream captured by a lifelogging wearable camera is of high interest for different applications (e.g. memory reinforcement). In this paper, we propose a new summarization method based on keyframes selection that uses visual features extracted by means of a convolutional neural network. Our method applies an unsupervised clustering for dividing the photostreams into events, and finally extracts the most relevant keyframe for each event. We assess the results by applying a blind-taste test on a group of 20 people who assessed the quality of the summaries.Comment: Paper accepted in the IEEE First International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX). Turin, Italy. July 3, 201

    Multiple Trajectory Prediction of Moving Agents with Memory Augmented Networks

    Get PDF
    Pedestrians and drivers are expected to safely navigate complex urban environments along with several non cooperating agents. Autonomous vehicles will soon replicate this capability. Each agent acquires a representation of the world from an egocentric perspective and must make decisions ensuring safety for itself and others. This requires to predict motion patterns of observed agents for a far enough future. In this paper we propose MANTRA, a model that exploits memory augmented networks to effectively predict multiple trajectories of other agents, observed from an egocentric perspective. Our model stores observations in memory and uses trained controllers to write meaningful pattern encodings and read trajectories that are most likely to occur in future. We show that our method is able to natively perform multi-modal trajectory prediction obtaining state-of-the art results on four datasets. Moreover, thanks to the non-parametric nature of the memory module, we show how once trained our system can continuously improve by ingesting novel patterns

    Open-ended visual question answering

    Get PDF
    Wearable cameras generate a large amount of photos which are, in many cases, useless or redundant. On the other hand, these devices are provide an excellent opportunity to create automatic questions and answers for reminiscence therapy. This is a follow up of the BSc thesis developed by Ricard Mestre during Fall 2014, and MSc thesis developed by Aniol Lidon.This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations. The source code and models are publicly available at https://github.com/imatge-upc/vqa-2016-cvprw.Esta tesis estudia métodos para resolver tareas de Visual Question-Answering usando técnicas de Deep Learning. Como primer paso, exploramos las redes Long Short-Term Memory (LST) que se usan en el Procesado del Lenguaje Natural (NLP) para atacar tareas de Question-Answering basadas únicamente en texto. A continuación modificamos el modelo anterior para aceptar una imagen como entrada junto con la pregunta. Para este propósito, estudiamos el uso de las redes convolucionales VGG-16 y K-CNN para extraer los descriptores visuales de la imagen. Estos descriptores son fusionados con el word embedding o sentence embedding de la pregunta para poder predecir la respuesta. Este trabajo se ha presentado al Visual Question Answering Challenge 2016, donde ha obtenido una precisión del 53,62% en los datos de test. El software desarrollado ha usado buenas prácticas de programación y ha seguido las directrices de estilo de Python, proveyendo un proyecto base en Keras consistente a distintas configuraciones. El código fuente y los modelos son públicos en https://github.com/imatge-upc/ vqa-2016-cvprw.Aquesta tesis estudia mètodes per resoldre tasques de Visual Question-Answering emprant tècniques de Deep Learning. Com a pas preliminar, explorem les xarxes Long Short-Term Memory (LSTM) que s'utilitzen en el Processat del Llenguatge Natural (NLP) per atacar tasques de Question-Answering basades únicament en text. A continuació modifiquem el model anterior per acceptar una imatge com a entrada juntament amb la pregunta. Per aquest propòsit, estudiem l'ús de les xarxes convolucionals VGG-16 i KCNN per tal d'extreure els descriptors visuals de la imatge. Aquests descriptors són fusionats amb el word embedding o sentence embedding de la pregunta per poder predir la resposta. Aquest treball ha estat presentat al Visual Question Answering Challenge 2016, on ha obtingut una precisió del 53,62% en les dades de test. El software desenvolupat ha emprat bones pràctiques en programació i ha seguit les directrius d'estil de Python, prove ïnt un projecte base en Keras consistent a diferents configuracions. El codi font i els models són públics a https://github.com/imatge-upc/vqa-2016-cvprw

    Visual summary of egocentric photostreams by representative keyframes

    No full text
    Building a visual summary from an egocentric photostream captured by a lifelogging wearable camera is of high interest for different applications (e.g. memory reinforcement). In this paper, we propose a new summarization method based on keyframes selection that uses visual features extracted by means of a convolutional neural network. Our method applies an unsupervised clustering for dividing the photostreams into events, and finally extracts the most relevant keyframe for each event. We assess the results by applying a blind-taste test on a group of 20 people who assessed the quality of the summaries.Peer Reviewe
    corecore