10 research outputs found

    Behaviour understanding through the analysis of image sequences collected by wearable cameras

    Get PDF
    Describing people's lifestyle has become a hot topic in the field of artificial intelligence. Lifelogging is described as the process of collecting personal activity data describing the daily behaviour of a person. Nowadays, the development of new technologies and the increasing use of wearable sensors allow to automatically record data from our daily living. In this paper, we describe our developed automatic tools for the analysis of collected visual data that describes the daily behaviour of a person. For this analysis, we rely on sequences of images collected by wearable cameras, which are called egocentric photo-streams. These images are a rich source of information about the behaviour of the camera wearer since they show an objective and first-person view of his or her lifestyle

    Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

    Full text link
    In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context. Our assumption is that in the procedure of performing a manipulation task, what a person is doing determines where the person is looking at, and the gaze point reveals gaze and non-gaze regions which contain important and complementary information about the undergoing action. We propose a novel mutual context network (MCN) that jointly learns action-dependent gaze prediction and gaze-guided action recognition in an end-to-end manner. Experiments on public egocentric video datasets demonstrate that our MCN achieves state-of-the-art performance of both gaze prediction and action recognition

    Mining reality to explore the 21st century student experience

    Get PDF
    Understanding student experience is a key aspect of higher education research. To date, the dominant methods for advancing this area have been the use of surveys and interviews, methods that typically rely on post-event recollections or perceptions, which can be incomplete and unreliable. Advances in mobile sensor technologies afford the opportunity to capture continuous, naturally-occurring student activity. In this thesis, I propose a new research approach for higher education that redefines student experience in terms of objective activity observation, rather than a construct of perception. I argue that novel, technologically driven research practices such as ‘Reality Mining’—continuous capture of digital data from wearable devices and the use of multi-modal datasets captured over prolonged periods, offer a deeper, more accurate representation of students’ lived experience. To explore the potential of these new methods, I implemented and evaluated three approaches to gathering student activity and behaviour data. I collected data from 21 undergraduate health science students at the University of Otago, over the period of a single semester (approximately four months). The data captured included GPS trace data from a smartphone app to explore student spaces and movements; photo data from a wearable auto-camera (that takes a photo from the wearer’s point-of-view, every 30 seconds) to investigate student activities; and computer usage data captured via the RescueTime software to gain insight into students’ digital practices. I explored the findings of these three datasets, visualising the student experience in different ways to demonstrate different perspectives on student activity, and utilised a number of new analytical approaches (such as Computer Vision algorithms for automatically categorising photostream data) to make sense of the voluminous data generated. To help future researchers wanting to utilise similar techniques, I also outlined the limitations and challenges encountered in using these new methods/devices for research. The findings of the three method explorations offer some insights into various aspects of the student experience, but serve mostly to highlight the idiographic nature of student life. The principal finding of this research is that these types of ‘student analytics’ are most readily useful to the students themselves, for highlighting their practices and informing self-improvement. I look at this aspect through the lens of a movement called the ‘Quantified Self’, which promotes the use of self-tracking technologies for personal development. To conclude my thesis, I discuss broadly how these methods could feature in higher education research, for researchers, for the institution, and, most importantly, for the students themselves. To this end, I develop a conceptual framework derived from Tschumi’s (1976) Space-Event-Movement framework. At the same time, I also take a critical perspective about the role of these types of personal analytics in the future of higher education, and question how involved the institution should be in the capture and utilisation of these data. Ultimately, there is value in exploring these data capture methods further, but always keeping the ‘student’ placed squarely at the centre of the ‘student experience’

    An Outlook into the Future of Egocentric Vision

    Full text link
    What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.Comment: We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1

    Text–to–Video: Image Semantics and NLP

    Get PDF
    When aiming at automatically translating an arbitrary text into a visual story, the main challenge consists in finding a semantically close visual representation whereby the displayed meaning should remain the same as in the given text. Besides, the appearance of an image itself largely influences how its meaningful information is transported towards an observer. This thesis now demonstrates that investigating in both, image semantics as well as the semantic relatedness between visual and textual sources enables us to tackle the challenging semantic gap and to find a semantically close translation from natural language to a corresponding visual representation. Within the last years, social networking became of high interest leading to an enormous and still increasing amount of online available data. Photo sharing sites like Flickr allow users to associate textual information with their uploaded imagery. Thus, this thesis exploits this huge knowledge source of user generated data providing initial links between images and words, and other meaningful data. In order to approach visual semantics, this work presents various methods to analyze the visual structure as well as the appearance of images in terms of meaningful similarities, aesthetic appeal, and emotional effect towards an observer. In detail, our GPU-based approach efficiently finds visual similarities between images in large datasets across visual domains and identifies various meanings for ambiguous words exploring similarity in online search results. Further, we investigate in the highly subjective aesthetic appeal of images and make use of deep learning to directly learn aesthetic rankings from a broad diversity of user reactions in social online behavior. To gain even deeper insights into the influence of visual appearance towards an observer, we explore how simple image processing is capable of actually changing the emotional perception and derive a simple but effective image filter. To identify meaningful connections between written text and visual representations, we employ methods from Natural Language Processing (NLP). Extensive textual processing allows us to create semantically relevant illustrations for simple text elements as well as complete storylines. More precisely, we present an approach that resolves dependencies in textual descriptions to arrange 3D models correctly. Further, we develop a method that finds semantically relevant illustrations to texts of different types based on a novel hierarchical querying algorithm. Finally, we present an optimization based framework that is capable of not only generating semantically relevant but also visually coherent picture stories in different styles.Bei der automatischen Umwandlung eines beliebigen Textes in eine visuelle Geschichte, besteht die größte Herausforderung darin eine semantisch passende visuelle Darstellung zu finden. Dabei sollte die Bedeutung der Darstellung dem vorgegebenen Text entsprechen. Darüber hinaus hat die Erscheinung eines Bildes einen großen Einfluß darauf, wie seine bedeutungsvollen Inhalte auf einen Betrachter übertragen werden. Diese Dissertation zeigt, dass die Erforschung sowohl der Bildsemantik als auch der semantischen Verbindung zwischen visuellen und textuellen Quellen es ermöglicht, die anspruchsvolle semantische Lücke zu schließen und eine semantisch nahe Übersetzung von natürlicher Sprache in eine entsprechend sinngemäße visuelle Darstellung zu finden. Des Weiteren gewann die soziale Vernetzung in den letzten Jahren zunehmend an Bedeutung, was zu einer enormen und immer noch wachsenden Menge an online verfügbaren Daten geführt hat. Foto-Sharing-Websites wie Flickr ermöglichen es Benutzern, Textinformationen mit ihren hochgeladenen Bildern zu verknüpfen. Die vorliegende Arbeit nutzt die enorme Wissensquelle von benutzergenerierten Daten welche erste Verbindungen zwischen Bildern und Wörtern sowie anderen aussagekräftigen Daten zur Verfügung stellt. Zur Erforschung der visuellen Semantik stellt diese Arbeit unterschiedliche Methoden vor, um die visuelle Struktur sowie die Wirkung von Bildern in Bezug auf bedeutungsvolle Ähnlichkeiten, ästhetische Erscheinung und emotionalem Einfluss auf einen Beobachter zu analysieren. Genauer gesagt, findet unser GPU-basierter Ansatz effizient visuelle Ähnlichkeiten zwischen Bildern in großen Datenmengen quer über visuelle Domänen hinweg und identifiziert verschiedene Bedeutungen für mehrdeutige Wörter durch die Erforschung von Ähnlichkeiten in Online-Suchergebnissen. Des Weiteren wird die höchst subjektive ästhetische Anziehungskraft von Bildern untersucht und "deep learning" genutzt, um direkt ästhetische Einordnungen aus einer breiten Vielfalt von Benutzerreaktionen im sozialen Online-Verhalten zu lernen. Um noch tiefere Erkenntnisse über den Einfluss des visuellen Erscheinungsbildes auf einen Betrachter zu gewinnen, wird erforscht, wie alleinig einfache Bildverarbeitung in der Lage ist, tatsächlich die emotionale Wahrnehmung zu verändern und ein einfacher aber wirkungsvoller Bildfilter davon abgeleitet werden kann. Um bedeutungserhaltende Verbindungen zwischen geschriebenem Text und visueller Darstellung zu ermitteln, werden Methoden des "Natural Language Processing (NLP)" verwendet, die der Verarbeitung natürlicher Sprache dienen. Der Einsatz umfangreicher Textverarbeitung ermöglicht es, semantisch relevante Illustrationen für einfache Textteile sowie für komplette Handlungsstränge zu erzeugen. Im Detail wird ein Ansatz vorgestellt, der Abhängigkeiten in Textbeschreibungen auflöst, um 3D-Modelle korrekt anzuordnen. Des Weiteren wird eine Methode entwickelt die, basierend auf einem neuen hierarchischen Such-Anfrage Algorithmus, semantisch relevante Illustrationen zu Texten verschiedener Art findet. Schließlich wird ein optimierungsbasiertes Framework vorgestellt, das nicht nur semantisch relevante, sondern auch visuell kohärente Bildgeschichten in verschiedenen Bildstilen erzeugen kann

    Ramon Llull's Ars Magna

    Get PDF
    corecore