357 research outputs found

    Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval

    Get PDF
    Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, i.e. estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this paper introduces a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison between the state-of-the-art, a new experimental protocol is presented, with training sets containing 10k, 100k and 1m images and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future.Comment: to appear in ACM Computing Survey

    Image Understanding by Socializing the Semantic Gap

    Get PDF
    Several technological developments like the Internet, mobile devices and Social Networks have spurred the sharing of images in unprecedented volumes, making tagging and commenting a common habit. Despite the recent progress in image analysis, the problem of Semantic Gap still hinders machines in fully understand the rich semantic of a shared photo. In this book, we tackle this problem by exploiting social network contributions. A comprehensive treatise of three linked problems on image annotation is presented, with a novel experimental protocol used to test eleven state-of-the-art methods. Three novel approaches to annotate, under stand the sentiment and predict the popularity of an image are presented. We conclude with the many challenges and opportunities ahead for the multimedia community

    VISIR : visual and semantic image label refinement

    Get PDF
    The social media explosion has populated the Internet with a wealth of images. There are two existing paradigms for image retrieval: 1) content-based image retrieval (CBIR), which has traditionally used visual features for similarity search (e.g., SIFT features), and 2) tag-based image retrieval (TBIR), which has relied on user tagging (e.g., Flickr tags). CBIR now gains semantic expressiveness by advances in deep-learning-based detection of visual labels. TBIR benefits from query-and-click logs to automatically infer more informative labels. However, learning-based tagging still yields noisy labels and is restricted to concrete objects, missing out on generalizations and abstractions. Click-based tagging is limited to terms that appear in the textual context of an image or in queries that lead to a click. This paper addresses the above limitations by semantically refining and expanding the labels suggested by learning-based object detection. We consider the semantic coherence between the labels for different objects, leverage lexical and commonsense knowledge, and cast the label assignment into a constrained optimization problem solved by an integer linear program. Experiments show that our method, called VISIR, improves the quality of the state-of-the-art visual labeling tools like LSDA and YOLO

    Performance evaluation of depth completion neural networks for various RGB-D camera technologies

    Get PDF
    Le telecamere RGB-D sono dispositivi utilizzati oggi in vari applicazioni e settori di ricerca che riguardano e richiedono una conoscenza tridimensionale dell'ambiente, espressa come un'immagine di profondità dove ciascun pixel rappresenta la distanza dalla telecamera dell'oggetto a cui appartiene. Le tecniche di acquisizione più diffuse includono la stereoscopia attiva, che triangola due immagini da due punti diversi della telecamera, e le telecamere a luce strutturata, che fanno lo stesso con un'immagine della telecamera e un proiettore laser. Un'altra tecnologia popolare che non richiede la triangolazione, utilizzata nelle telecamere LiDAR, è il ToF (Time of Flight): il rilevamento della profondità si basa sul tempo di ricezione di un segnale emesso, ad esempio un segnale IR, in tutto il campo visivo della telecamera. Le maggiori difficoltà riscontrate con l'uso delle telecamere RGB-D si basano sull'ambiente di acquisizione delle immagini e sulle caratteristiche della telecamera stessa: la presenza di bordi e variazioni nelle condizioni di illuminazione possono portare a mappe di profondità rumorose o incomplete, con un impatto negativo sulle prestazioni delle applicazioni di computer vision e robotica che si basano su informazioni precise sull'immagine di profondità. Negli ultimi anni sono state proposte diverse tecniche di miglioramento della profondità, tra cui l'uso di reti neurali per il completamento dell'immagine di profondità. L'obiettivo del completamento della profondità è quello di generare una previsione di profondità densa, quindi continua sull'intera immagine, a partire dalla conoscenza dell'immagine RGB e dell'immagine grezza di profondità acquisita dal sensore RGB-D. I metodi di completamento della profondità utilizzano input RGB e di profondità grezzi attraverso la tecnologia encoder-decoder, con aggiornamenti recenti che utilizzano processi di raffinazione ed informazioni aggiuntive come i dati semantici per migliorare la precisione ed analizzare i bordi degli oggetti. Tuttavia, gli unici metodi utilizzati al momento sono quelli che si basano su un piccolo campo recettivo, come le CNN e le reti di propagazione spaziale locale. Se ci sono zone di pixel non validi che sono troppo grandi, l'utilizzo di un campo ricettivo limitato presenta lo svantaggio di di produrre previsioni errate. In questa tesi viene proposta una valutazione delle prestazioni dell'attuale stato dell'arte del completamento delle immagini di profondità su uno scenario reale indoor. Per la valutazione sperimentale sono stati presi in considerazione diversi sensori RGB-D, evidenziando i pro e i contro delle diverse tecnologie per la misurazione della profondità con le telecamere. Le varie acquisizioni sono state effettuate in ambienti diversi e con telecamere che utilizzano tecnologie diverse per analizzare la criticità delle profondità ottenute prima direttamente con le telecamere e poi applicando le reti neurali allo stato dell'arte. Secondo i risultati di questo lavoro di tesi, le reti allo stato dell'arte non sono ancora abbastanza mature per essere utilizzate in scenari troppo diversi da quelli utilizzati nel rispettivo training. In particolare, sono state scoperte le seguenti limitazioni: per le reti testate con dati indoor, il training su dati outdoor è meno efficace di un approccio diretto basato su operatori morfologici.RGB-D cameras are devices that are used these days in various fields that benefit from the knowledge of depth in an image. The most popular acquisition techniques include active stereoscopic, which triangulates two camera views, and structured light cameras, which do the same with a camera image and a laser projector. Another popular technology that doesn’t require triangulation, used in LiDAR cameras, is ToF (Time of Flight): depth detection is based on the detection time of an emitted signal, such as an IR signal, throughout the camera’s Field of View. The major complexities encountered with the use of RGB-D cameras are based on the image acquisition environment and the camera characteristics themselves: poorly defined edges and variations in light conditions can lead to noisy or incomplete depth maps, which can negatively impact the performance of computer vision and robotics applications that rely on accurate depth information. Several depth enhancement techniques have been proposed in recent years, many of them making use of neural networks for depth completion. The goal of the depth completion task is to generate a dense depth prediction, continuous over the entire image, from knowledge of the RGB image and raw depth image acquired by the RGB-D sensor. Depth completion methods use RGB and sparse depth inputs through encoder-decoder technology, with recent upgrades using refinement and additional information such as semantic data to improve accuracy and analyze object edges and occluded items. However, the only methods used at this time are those that rely on a small receptive field, like CNNs and Local Spatial Propagation networks. If there are invalid pixel holes that are too big and lack a value in the depth map, this limited receptive field has the disadvantage of producing incorrect predictions. In this thesis, a performance evaluation of the current depth completion state-of-the-art on a real indoor scenario is proposed. Several RGB-D sensors have been taken into account for the experimental evaluation, highlighting the pros and cons of different technologies for depth measurements with cameras. The various acquisitions were carried out in different environments and with cameras using different technologies to analyze the criticality of the depths obtained first directly with the cameras and then applying the state-of-the-art depth completion networks. According to the findings of this thesis work, state-of-the-art networks are not yet mature enough to be used in scenarios that are too dissimilar from those used by the respective authors. We discovered the following limitations in particular: deep networks trained using outdoor scenes are not effective when analyzing indoor scenes. In such cases, a straightforward approach based on morphologic operators is more accurate

    Algorithms, applications and systems towards interpretable pattern mining from multi-aspect data

    Get PDF
    How do humans move around in the urban space and how do they differ when the city undergoes terrorist attacks? How do users behave in Massive Open Online courses~(MOOCs) and how do they differ if some of them achieve certificates while some of them not? What areas in the court elite players, such as Stephen Curry, LeBron James, like to make their shots in the course of the game? How can we uncover the hidden habits that govern our online purchases? Are there unspoken agendas in how different states pass legislation of certain kinds? At the heart of these seemingly unconnected puzzles is this same mystery of multi-aspect mining, i.g., how can we mine and interpret the hidden pattern from a dataset that simultaneously reveals the associations, or changes of the associations, among various aspects of the data (e.g., a shot could be described with three aspects, player, time of the game, and area in the court)? Solving this problem could open gates to a deep understanding of underlying mechanisms for many real-world phenomena. While much of the research in multi-aspect mining contribute broad scope of innovations in the mining part, interpretation of patterns from the perspective of users (or domain experts) is often overlooked. Questions like what do they require for patterns, how good are the patterns, or how to read them, have barely been addressed. Without efficient and effective ways of involving users in the process of multi-aspect mining, the results are likely to lead to something difficult for them to comprehend. This dissertation proposes the M^3 framework, which consists of multiplex pattern discovery, multifaceted pattern evaluation, and multipurpose pattern presentation, to tackle the challenges of multi-aspect pattern discovery. Based on this framework, we develop algorithms, applications, and analytic systems to enable interpretable pattern discovery from multi-aspect data. Following the concept of meaningful multiplex pattern discovery, we propose PairFac to close the gap between human information needs and naive mining optimization. We demonstrate its effectiveness in the context of impact discovery in the aftermath of urban disasters. We develop iDisc to target the crossing of multiplex pattern discovery with multifaceted pattern evaluation. iDisc meets the specific information need in understanding multi-level, contrastive behavior patterns. As an example, we use iDisc to predict student performance outcomes in Massive Open Online Courses given users' latent behaviors. FacIt is an interactive visual analytic system that sits at the intersection of all three components and enables for interpretable, fine-tunable, and scrutinizable pattern discovery from multi-aspect data. We demonstrate each work's significance and implications in its respective problem context. As a whole, this series of studies is an effort to instantiate the M^3 framework and push the field of multi-aspect mining towards a more human-centric process in real-world applications

    L1 Graph Based Sparse Model for Label De-noising

    Get PDF

    Motion capture data processing, retrieval and recognition.

    Get PDF
    Character animation plays an essential role in the area of featured film and computer games. Manually creating character animation by animators is both tedious and inefficient, where motion capture techniques (MoCap) have been developed and become the most popular method for creating realistic character animation products. Commercial MoCap systems are expensive and the capturing process itself usually requires an indoor studio environment. Procedural animation creation is often lacking extensive user control during the generation progress. Therefore, efficiently and effectively reusing MoCap data can brings significant benefits, which has motivated wider research in terms of machine learning based MoCap data processing. A typical work flow of MoCap data reusing can be divided into 3 stages: data capture, data management and data reusing. There are still many challenges at each stage. For instance, the data capture and management often suffer from data quality problems. The efficient and effective retrieval method is also demanding due to the large amount of data being used. In addition, classification and understanding of actions are the fundamental basis of data reusing. This thesis proposes to use machine learning on MoCap data for reusing purposes, where a frame work of motion capture data processing is designed. The modular design of this framework enables motion data refinement, retrieval and recognition. The first part of this thesis introduces various methods used in existing motion capture processing approaches in literature and a brief introduction of relevant machine learning methods used in this framework. In general, the frameworks related to refinement, retrieval, recognition are discussed. A motion refinement algorithm based on dictionary learning will then be presented, where kinematical structural and temporal information are exploited. The designed optimization method and data preprocessing technique can ensure a smooth property for the recovered result. After that, a motion refinement algorithm based on matrix completion is presented, where the low-rank property and spatio-temporal information is exploited. Such model does not require preparing data for training. The designed optimization method outperforms existing approaches in regard to both effectiveness and efficiency. A motion retrieval method based on multi-view feature selection is also proposed, where the intrinsic relations between visual words in each motion feature subspace are discovered as a means of improving the retrieval performance. A provisional trace-ratio objective function and an iterative optimization method are also included. A non-negative matrix factorization based motion data clustering method is proposed for recognition purposes, which aims to deal with large scale unsupervised/semi-supervised problems. In addition, deep learning models are used for motion data recognition, e.g. 2D gait recognition and 3D MoCap recognition. To sum up, the research on motion data refinement, retrieval and recognition are presented in this thesis with an aim to tackle the major challenges in motion reusing. The proposed motion refinement methods aim to provide high quality clean motion data for downstream applications. The designed multi-view feature selection algorithm aims to improve the motion retrieval performance. The proposed motion recognition methods are equally essential for motion understanding. A collection of publications by the author of this thesis are noted in publications section
    • …
    corecore