13 research outputs found

    Ten Research Questions for Scalable Multimedia Analytics

    Get PDF
    International audienceThe scale and complexity of multimedia collections is ever increasing, as is the desire to harvest useful insight from the collections. To optimally support the complex quest for insight, multimedia ana-lytics has emerged as a new research area that combines concepts and techniques from multimedia analysis and visual analytics into a single framework. State of the art multimedia analytics solutions are highly interactive and give users freedom in how they perform their analytics task, but they do not scale well. State of the art scalable database management solutions, on the other hand, are not yet designed for multimedia analytics workloads. In this position paper we therefore argue the need for research on scalable multimedia analytics, a new research area built on the three pillars of visual analytics, multimedia analysis and database management. We propose a specific goal for scalable multimedia analyt-ics and present several important research questions that we believe must be addressed in order to achieve that goal

    Urban Image Geo-Localization Using Open Data on Public Spaces

    Get PDF
    In this paper, we study the problem of urban image geo-localization, where the aim is to estimate the real-world location in which an image was taken. Among the previous approaches to this task, we note three distinct categories: one only analyzes metadata; the other only analyzes the image content; and the third combines the two. However, most previous approaches require large annotated collections of images or their metadata. Instead of relying on large collections of images, we propose to use publicly available geographical (GIS) data, which contains information about urban objects in public spaces, as a backbone database to query images against. We argue that images can be effectively represented by the objects they contain, and that the spatial geometry of a scene - i.e., the positioning of these objects relative to each other - can function as a unique identifier for a particular physical location. Our experiments demonstrate the potential of using open GIS data for precise image geolocation estimation and serve as a baseline for future research in multimedia geo-localization

    Interactive Multimodal Learning on 100 Million Images

    Get PDF
    This paper presents Blackthorn, an efficient interactive multimodal learning approach facilitating analysis of multimedia collections of 100 million items on a single high-end workstation. This is achieved by efficient data compression and optimizations to the interactive learning process. The compressed i-I64 data representation costs tens of bytes per item yet preserves most of the visual and textual semantic information. The optimized interactive learning model scores the i-I64-compressed data directly, greatly reducing the computational requirements. The experiments show that Blackthorn is up to 105x faster than the conventional relevance feedback baseline. Blackthorn is shown to vastly outperform the baseline with respect to recall over time. Blackthorn reaches up to 92% of the precision achieved by the baseline, validating the efficacy of the i-I64 representation. On the YFCC100M dataset, Blackthorn performes one complete interaction round in 0.7 seconds. Blackthorn thus opens multimedia collections comprising 100 million items to learning-based analysis in fully interactive time

    Exquisitor at聽the聽Video Browser Showdown 2022

    Get PDF
    Exquisitor is the state-of-the-art large-scale interactive learning approach for media exploration that utilizes user relevance feedback at its core and is capable of interacting with collections containing more than 100M multimedia items at sub-second latency. In this work, we propose improvements to Exquisitor that include new features extracted at shot level for semantic concepts, scenes and actions. In addition, we introduce extensions to the video summary interface providing a better overview of the shots. Finally, we replace a simple keyword search featured in the previous versions of the system with a semantic search based on modern contextual representations.</p

    Exquisitor at聽the聽Video Browser Showdown 2022

    No full text
    Exquisitor is the state-of-the-art large-scale interactive learning approach for media exploration that utilizes user relevance feedback at its core and is capable of interacting with collections containing more than 100M multimedia items at sub-second latency. In this work, we propose improvements to Exquisitor that include new features extracted at shot level for semantic concepts, scenes and actions. In addition, we introduce extensions to the video summary interface providing a better overview of the shots. Finally, we replace a simple keyword search featured in the previous versions of the system with a semantic search based on modern contextual representations

    Exquisitor at聽the聽Video Browser Showdown 2022

    No full text
    Exquisitor is the state-of-the-art large-scale interactive learning approach for media exploration that utilizes user relevance feedback at its core and is capable of interacting with collections containing more than 100M multimedia items at sub-second latency. In this work, we propose improvements to Exquisitor that include new features extracted at shot level for semantic concepts, scenes and actions. In addition, we introduce extensions to the video summary interface providing a better overview of the shots. Finally, we replace a simple keyword search featured in the previous versions of the system with a semantic search based on modern contextual representations

    Discovering Geographic Regions in the City Using Social Multimedia and Open Data

    No full text
    In this paper we investigate the potential of social multimedia and open data for automatically identifying regions within the city. We conjecture that the regions may be characterized by specific patterns related to their visual appearance, the manner in which the social media users describe them, and the human mobility patterns. Therefore, we collect a dataset of Foursquare venues, their associated images and users, which we further enrich with a collection of city-specific Flickr images, annotations and users. Additionally, we collect a large number of neighbourhood statistics related to e.g., demographics, housing and services. We then represent visual content of the images using a large set of semantic concepts output by a convolutional neural network and extract latent Dirichlet topics from their annotations. User, text and visual information as well as the neighbourhood statistics are further aggregated at the level of postal code regions, which we use as the basis for detecting larger regions in the city. To identify those regions, we perform clustering based on individual modalities as well as their ensemble. The experimental analysis shows that the automatically detected regions are meaningful and have a potential for better understanding dynamics and complexity of a city
    corecore