3,953 research outputs found

    Self-Supervised Relative Depth Learning for Urban Scene Understanding

    Full text link
    As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth. It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, faraway mountains don't move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we start by training a deep network, using fully automatic supervision, to predict relative scene depth from single images. The relative depth training images are automatically derived from simple videos of cars moving through a scene, using recent motion segmentation techniques, and no human-provided labels. This proxy task of predicting relative depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (absolute) depth estimation, over a network trained from scratch. The improvement on the semantic segmentation task is greater than those produced by any other automatically supervised methods. Moreover, for monocular depth estimation, our unsupervised pre-training method even outperforms supervised pre-training with ImageNet. In addition, we demonstrate benefits from learning to predict (unsupervised) relative depth in the specific videos associated with various downstream tasks. We adapt to the specific scenes in those tasks in an unsupervised manner to improve performance. In summary, for semantic segmentation, we present state-of-the-art results among methods that do not use supervised pre-training, and we even exceed the performance of supervised ImageNet pre-trained models for monocular depth estimation, achieving results that are comparable with state-of-the-art methods

    Media aesthetics based multimedia storytelling.

    Get PDF
    Since the earliest of times, humans have been interested in recording their life experiences, for future reference and for storytelling purposes. This task of recording experiences --i.e., both image and video capture-- has never before in history been as easy as it is today. This is creating a digital information overload that is becoming a great concern for the people that are trying to preserve their life experiences. As high-resolution digital still and video cameras become increasingly pervasive, unprecedented amounts of multimedia, are being downloaded to personal hard drives, and also uploaded to online social networks on a daily basis. The work presented in this dissertation is a contribution in the area of multimedia organization, as well as automatic selection of media for storytelling purposes, which eases the human task of summarizing a collection of images or videos in order to be shared with other people. As opposed to some prior art in this area, we have taken an approach in which neither user generated tags nor comments --that describe the photographs, either in their local or on-line repositories-- are taken into account, and also no user interaction with the algorithms is expected. We take an image analysis approach where both the context images --e.g. images from online social networks to which the image stories are going to be uploaded--, and the collection images --i.e., the collection of images or videos that needs to be summarized into a story--, are analyzed using image processing algorithms. This allows us to extract relevant metadata that can be used in the summarization process. Multimedia-storytellers usually follow three main steps when preparing their stories: first they choose the main story characters, the main events to describe, and finally from these media sub-groups, they choose the media based on their relevance to the story as well as based on their aesthetic value. Therefore, one of the main contributions of our work has been the design of computational models --both regression based, as well as classification based-- that correlate well with human perception of the aesthetic value of images and videos. These computational aesthetics models have been integrated into automatic selection algorithms for multimedia storytelling, which are another important contribution of our work. A human centric approach has been used in all experiments where it was feasible, and also in order to assess the final summarization results, i.e., humans are always the final judges of our algorithms, either by inspecting the aesthetic quality of the media, or by inspecting the final story generated by our algorithms. We are aware that a perfect automatically generated story summary is very hard to obtain, given the many subjective factors that play a role in such a creative process; rather, the presented approach should be seen as a first step in the storytelling creative process which removes some of the ground work that would be tedious and time consuming for the user. Overall, the main contributions of this work can be capitalized in three: (1) new media aesthetics models for both images and videos that correlate with human perception, (2) new scalable multimedia collection structures that ease the process of media summarization, and finally, (3) new media selection algorithms that are optimized for multimedia storytelling purposes.Postprint (published version

    Future of networking is the future of Big Data, The

    Get PDF
    2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science

    Temporal multimodal video and lifelog retrieval

    Get PDF
    The past decades have seen exponential growth of both consumption and production of data, with multimedia such as images and videos contributing significantly to said growth. The widespread proliferation of smartphones has provided everyday users with the ability to consume and produce such content easily. As the complexity and diversity of multimedia data has grown, so has the need for more complex retrieval models which address the information needs of users. Finding relevant multimedia content is central in many scenarios, from internet search engines and medical retrieval to querying one's personal multimedia archive, also called lifelog. Traditional retrieval models have often focused on queries targeting small units of retrieval, yet users usually remember temporal context and expect results to include this. However, there is little research into enabling these information needs in interactive multimedia retrieval. In this thesis, we aim to close this research gap by making several contributions to multimedia retrieval with a focus on two scenarios, namely video and lifelog retrieval. We provide a retrieval model for complex information needs with temporal components, including a data model for multimedia retrieval, a query model for complex information needs, and a modular and adaptable query execution model which includes novel algorithms for result fusion. The concepts and models are implemented in vitrivr, an open-source multimodal multimedia retrieval system, which covers all aspects from extraction to query formulation and browsing. vitrivr has proven its usefulness in evaluation campaigns and is now used in two large-scale interdisciplinary research projects. We show the feasibility and effectiveness of our contributions in two ways: firstly, through results from user-centric evaluations which pit different user-system combinations against one another. Secondly, we perform a system-centric evaluation by creating a new dataset for temporal information needs in video and lifelog retrieval with which we quantitatively evaluate our models. The results show significant benefits for systems that enable users to specify more complex information needs with temporal components. Participation in interactive retrieval evaluation campaigns over multiple years provides insight into possible future developments and challenges of such campaigns

    In-vehicle filming of driver fatigue on YouTube: vlogs, crashes and bad advice

    Get PDF
    Background: Driver fatigue contributes to 15-30% of crashes, however it is difficult to objectively measure. Fatigue mitigation relies on driver self-moderation, placing great importance on the necessity for road safety campaigns to engage with their audience. Popular self-archiving website YouTube.com is a relatively unused source of public perceptions. Method: A systematic YouTube.com search (videos uploaded 2/12/09 - 2/12/14) was conducted using driver fatigue related search terms. 442 relevant videos were identified. In-vehicle footage was separated for further analysis. Video reception was quantified in terms of number of views, likes, comments, dislikes and times duplicated. Qualitative analysis of comments was undertaken to identify key themes. Results: 4.2% (n=107) of relevant uploaded videos contained in-vehicle footage. Three types of videos were identified: (1) dashcam footage (n=82); (2) speaking directly to the camera - vlogs (n=16); (3) passengers filming drivers (n=9). Two distinct types of comments emerged, those directly relating to driver fatigue and those more broadly about the video or its uploader. Driver fatigue comments included: attribution of behaviour cause, emotion experienced when watching the video and personal advice on staying awake while driving. Discussion: In-vehicle footage related to driver fatigue is prevalent on YouTube.com and is actively engaged with by viewers. Comments were mixed in terms of criticism and sympathy for drivers. Willingness to share advice on staying awake suggests driver fatigue may be seen as a common yet controllable occurrence. This project provides new insight into driver fatigue perception, which may be considered by safety authorities when designing education campaigns

    CineScale2: a dataset of cinematic camera features in movies

    Get PDF
    The position and orientation of the camera in relation to the subject(s) in a movie scene, namely camera "level" and camera "angle", are essential features in the film-making process due to their influence on the viewer's perception of the scene. We provide a database containing camera feature annotations on camera angle and camera level, for about 25,000 image frames. Frames are sampled from a wide range of movies, freely available images, and shots from cinematographic websites, and are annotated on the following five categories - Overhead, High, Neutral, Low, and Dutch - for what concerns camera angle, and on six different classes of camera level: Aerial, Eye, Shoulder, Hip, Knee, and Ground level. This dataset is an extension of the Cinescale dataset [1], which contains movie frames and related annotations regarding shot scale. The CineScale2 database enables AI-driven interpretation of shot scale data and opens to a large set of research activities related to the automatic visual analysis of cinematic material, such as movie stylistic analysis, video recommendation, and media psychology. To these purposes, we also provide the model and the code for building a Convolutional Neural Network (CNN) architecture for automated camera feature recognition. All the material is provided on the the project website; video frames can be also provided upon requests to authors, for research purposes under fair use

    MarathOn Multiscreen: group television watching and interaction in a viewing ecology

    Get PDF
    This paper reports and discusses the findings of an exploratory study into collaborative user practice with a multiscreen television application. MarathOn Multiscreen allows users to view, share and curate amateur and professional video footage of a community marathon event. Our investigations focused on collaborative sharing practices across different viewing activities and devices, the roles taken by different devices in a viewing ecology, and observations on how users consume professional and amateur content. Our Work uncovers significant differences in user behaviour and collaboration when engaged in more participatory viewing activities, such as sorting and ranking footage, which has implications for awareness of other users’ interactions while viewing together and alone. In addition, user appreciation and use of amateur video content is dependent not only on quality and activity but their personal involvement in the contents
    • …
    corecore