420 research outputs found

    SVS-JOIN : efficient spatial visual similarity join for geo-multimedia

    Get PDF
    In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale geo-multimedia retrieval. Spatial similarity join is one of the significant problems in the area of spatial database. Previous works focused on spatial textual document search problem, rather than geo-multimedia retrieval. In this paper, we investigate a novel geo-multimedia retrieval paradigm named spatial visual similarity join (SVS-JOIN for short), which aims to search similar geo-image pairs in both aspects of geo-location and visual content. Firstly, the definition of SVS-JOIN is proposed and then we present the geographical similarity and visual similarity measurement. Inspired by the approach for textual similarity join, we develop an algorithm named SVS-JOIN B by combining the PPJOIN algorithm and visual similarity. Besides, an extension of it named SVS-JOIN G is developed, which utilizes spatial grid strategy to improve the search efficiency. To further speed up the search, a novel approach called SVS-JOIN Q is carefully designed, in which a quadtree and a global inverted index are employed. Comprehensive experiments are conducted on two geo-image datasets and the results demonstrate that our solution can address the SVS-JOIN problem effectively and efficiently

    Applications of a Graph Theoretic Based Clustering Framework in Computer Vision and Pattern Recognition

    Full text link
    Recently, several clustering algorithms have been used to solve variety of problems from different discipline. This dissertation aims to address different challenging tasks in computer vision and pattern recognition by casting the problems as a clustering problem. We proposed novel approaches to solve multi-target tracking, visual geo-localization and outlier detection problems using a unified underlining clustering framework, i.e., dominant set clustering and its extensions, and presented a superior result over several state-of-the-art approaches.Comment: doctoral dissertatio

    Communication and Automatic Interpretation of Affect from Facial Expressions

    Get PDF

    Deep Learning for Semantic Video Understanding

    Get PDF
    The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video. This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems

    Service Abstractions for Scalable Deep Learning Inference at the Edge

    Get PDF
    Deep learning driven intelligent edge has already become a reality, where millions of mobile, wearable, and IoT devices analyze real-time data and transform those into actionable insights on-device. Typical approaches for optimizing deep learning inference mostly focus on accelerating the execution of individual inference tasks, without considering the contextual correlation unique to edge environments and the statistical nature of learning-based computation. Specifically, they treat inference workloads as individual black boxes and apply canonical system optimization techniques, developed over the last few decades, to handle them as yet another type of computation-intensive applications. As a result, deep learning inference on edge devices still face the ever increasing challenges of customization to edge device heterogeneity, fuzzy computation redundancy between inference tasks, and end-to-end deployment at scale. In this thesis, we propose the first framework that automates and scales the end-to-end process of deploying efficient deep learning inference from the cloud to heterogeneous edge devices. The framework consists of a series of service abstractions that handle DNN model tailoring, model indexing and query, and computation reuse for runtime inference respectively. Together, these services bridge the gap between deep learning training and inference, eliminate computation redundancy during inference execution, and further lower the barrier for deep learning algorithm and system co-optimization. To build efficient and scalable services, we take a unique algorithmic approach of harnessing the semantic correlation between the learning-based computation. Rather than viewing individual tasks as isolated black boxes, we optimize them collectively in a white box approach, proposing primitives to formulate the semantics of the deep learning workloads, algorithms to assess their hidden correlation (in terms of the input data, the neural network models, and the deployment trials) and merge common processing steps to minimize redundancy

    Feature based dynamic intra-video indexing

    Get PDF
    A thesis submitted in partial fulfillment for the degree of Doctor of PhilosophyWith the advent of digital imagery and its wide spread application in all vistas of life, it has become an important component in the world of communication. Video content ranging from broadcast news, sports, personal videos, surveillance, movies and entertainment and similar domains is increasing exponentially in quantity and it is becoming a challenge to retrieve content of interest from the corpora. This has led to an increased interest amongst the researchers to investigate concepts of video structure analysis, feature extraction, content annotation, tagging, video indexing, querying and retrieval to fulfil the requirements. However, most of the previous work is confined within specific domain and constrained by the quality, processing and storage capabilities. This thesis presents a novel framework agglomerating the established approaches from feature extraction to browsing in one system of content based video retrieval. The proposed framework significantly fills the gap identified while satisfying the imposed constraints of processing, storage, quality and retrieval times. The output entails a framework, methodology and prototype application to allow the user to efficiently and effectively retrieved content of interest such as age, gender and activity by specifying the relevant query. Experiments have shown plausible results with an average precision and recall of 0.91 and 0.92 respectively for face detection using Haar wavelets based approach. Precision of age ranges from 0.82 to 0.91 and recall from 0.78 to 0.84. The recognition of gender gives better precision with males (0.89) compared to females while recall gives a higher value with females (0.92). Activity of the subject has been detected using Hough transform and classified using Hiddell Markov Model. A comprehensive dataset to support similar studies has also been developed as part of the research process. A Graphical User Interface (GUI) providing a friendly and intuitive interface has been integrated into the developed system to facilitate the retrieval process. The comparison results of the intraclass correlation coefficient (ICC) shows that the performance of the system closely resembles with that of the human annotator. The performance has been optimised for time and error rate

    A Location-Aware Middleware Framework for Collaborative Visual Information Discovery and Retrieval

    Get PDF
    This work addresses the problem of scalable location-aware distributed indexing to enable the leveraging of collaborative effort for the construction and maintenance of world-scale visual maps and models which could support numerous activities including navigation, visual localization, persistent surveillance, structure from motion, and hazard or disaster detection. Current distributed approaches to mapping and modeling fail to incorporate global geospatial addressing and are limited in their functionality to customize search. Our solution is a peer-to-peer middleware framework based on XOR distance routing which employs a Hilbert Space curve addressing scheme in a novel distributed geographic index. This allows for a universal addressing scheme supporting publish and search in dynamic environments while ensuring global availability of the model and scalability with respect to geographic size and number of users. The framework is evaluated using large-scale network simulations and a search application that supports visual navigation in real-world experiments

    Large-scale Content-based Visual Information Retrieval

    Get PDF
    Rather than restricting search to the use of metadata, content-based information retrieval methods attempt to index, search and browse digital objects by means of signatures or features describing their actual content. Such methods have been intensively studied in the multimedia community to allow managing the massive amount of raw multimedia documents created every day (e.g. video will account to 84% of U.S. internet traffic by 2018). Recent years have consequently witnessed a consistent growth of content-aware and multi-modal search engines deployed on massive multimedia data. Popular multimedia search applications such as Google images, Youtube, Shazam, Tineye or MusicID clearly demonstrated that the first generation of large-scale audio-visual search technologies is now mature enough to be deployed on real-world big data. All these successful applications did greatly benefit from 15 years of research on multimedia analysis and efficient content-based indexing techniques. Yet the maturity reached by the first generation of content-based search engines does not preclude an intensive research activity in the field. There is actually still a lot of hard problems to be solved before we can retrieve any information in images or sounds as easily as we do in text documents. Content-based search methods actually have to reach a finer understanding of the contents as well as a higher semantic level. This requires modeling the raw signals by more and more complex and numerous features, so that the algorithms for analyzing, indexing and searching such features have to evolve accordingly. This thesis describes several of my works related to large-scale content-based information retrieval. The different contributions are presented in a bottom-up fashion reflecting a typical three-tier software architecture of an end-to-end multimedia information retrieval system. The lowest layer is only concerned with managing, indexing and searching large sets of high-dimensional feature vectors, whatever their origin or role in the upper levels (visual or audio features, global or part-based descriptions, low or high semantic level, etc. ). The middle layer rather works at the document level and is in charge of analyzing, indexing and searching collections of documents. It typically extracts and embeds the low-level features, implements the querying mechanisms and post-processes the results returned by the lower layer. The upper layer works at the applicative level and is in charge of providing useful and interactive functionalities to the end-user. It typically implements the front-end of the search application, the crawler and the orchestration of the different indexing and search services

    Advances in Monocular Exemplar-based Human Body Pose Analysis: Modeling, Detection and Tracking

    Get PDF
    Esta tesis contribuye en el análisis de la postura del cuerpo humano a partir de secuencias de imágenes adquiridas con una sola cámara. Esta temática presenta un amplio rango de potenciales aplicaciones en video-vigilancia, video-juegos o aplicaciones biomédicas. Las técnicas basadas en patrones han tenido éxito, sin embargo, su precisión depende de la similitud del punto de vista de la cámara y de las propiedades de la escena entre las imágenes de entrenamiento y las de prueba. Teniendo en cuenta un conjunto de datos de entrenamiento capturado mediante un número reducido de cámaras fijas, paralelas al suelo, se han identificado y analizado tres escenarios posibles con creciente nivel de dificultad: 1) una cámara estática paralela al suelo, 2) una cámara de vigilancia fija con un ángulo de visión considerablemente diferente, y 3) una secuencia de video capturada con una cámara en movimiento o simplemente una sola imagen estática

    Human-machine cooperation in large-scale multimedia retrieval : a survey

    Get PDF
    Large-Scale Multimedia Retrieval(LSMR) is the task to fast analyze a large amount of multimedia data like images or videos and accurately find the ones relevant to a certain semantic meaning. Although LSMR has been investigated for more than two decades in the fields of multimedia processing and computer vision, a more interdisciplinary approach is necessary to develop an LSMR system that is really meaningful for humans. To this end, this paper aims to stimulate attention to the LSMR problem from diverse research fields. By explaining basic terminologies in LSMR, we first survey several representative methods in chronological order. This reveals that due to prioritizing the generality and scalability for large-scale data, recent methods interpret semantic meanings with a completely different mechanism from humans, though such humanlike mechanisms were used in classical heuristic-based methods. Based on this, we discuss human-machine cooperation, which incorporates knowledge about human interpretation into LSMR without sacrificing the generality and scalability. In particular, we present three approaches to human-machine cooperation (cognitive, ontological, and adaptive), which are attributed to cognitive science, ontology engineering, and metacognition, respectively. We hope that this paper will create a bridge to enable researchers in different fields to communicate about the LSMR problem and lead to a ground-breaking next generation of LSMR systems
    corecore