2,019 research outputs found

    Transfer Learning from Audio-Visual Grounding to Speech Recognition

    Full text link
    Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains.Comment: Accepted to Interspeech 2019. 4 pages, 2 figure

    ICONCLASS - Klasifikacijski sustav za umjetnost i ikonografiju

    Get PDF
    Documenting is a crucial activity for any museum or art institution. Today, that importance is growing for the metadata museum provides us with, is essential in retrieving information in the vast amount of data of the modern world. The goal of this study is to discuss the design of thesauri, how they work and what is their purpose in documenting museum objects. It further discusses content indexing together with aboutness, isness and ofness, to draw a parallel with Panofsky’s categories in iconography. The central focus of the work falls onto analyzing Iconclass, its features, and usage. Additionally, it concentrates on new developments in machine learning within artificial intelligence, which use Iconclass to generate and automatize new data and connections. Finally, it gives a brief overview of folksonomy and social tagging.Dokumentiranje je ključna aktivnost svakog muzeja ili umjetničke institucije. Danas ta važnost raste jer metapodaci koje nam muzej pruža igraju bitnu ulogu u pronalaženju informacija u ogromnoj količini podataka suvremenog svijeta. Cilj ovog rada je predstaviti i raspravljati o dizajnu tezaurusa, kako oni rade i koja je njihova svrha u dokumentiranju muzejskih objekata. Nadalje se takodjer predstavlja sadržajnu obradu zajedno s sustinom, postojanoscu i svojstvom (aboutness, isness, ofness) kako bi se usporedila s Panofskijevim kategorijama u ikonografiji. Središnji fokus rada je analiziranje Iconclass-a, njegovih značajki i upotrebe. Osim toga, rad se usredotočuje na nove razvoje u strojnom učenju preko umjetne inteligencije, koji koriste Iconclass za generiranje i automatizaciju novih podataka i veza. Na kraju, daje se kratak pregled folksonomije i socijalnog označavanja

    Learning to detect video events from zero or very few video examples

    Get PDF
    In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of ``related'' videos. For learning only from an event's textual description, we first identify a general learning framework and then study the impact of different design choices for various stages of this framework. For additionally learning from example videos, when true positive training samples are scarce, we employ an extension of the Support Vector Machine that allows us to exploit ``related'' event videos by automatically introducing different weights for subsets of the videos in the overall training set. Experimental evaluations performed on the large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for publicatio

    Image Retrieval Method Combining Bayes and SVM Classifier Based on Relevance Feedback with Application to Small-scale Datasets

    Get PDF
    A vast amount of images has been generated due to the diversity and digitalization of devices for image acquisition. However, the gap between low-level visual features and high-level semantic representations has been a major concern that hinders retrieval accuracy. A retrieval method based on the transfer learning model and the relevance feedback technique was formulated in this study to optimize the dynamic trade-off between the structural complexity and retrieval performance of the small- and medium-scale content-based image retrieval (CBIR) system. First, the pretrained deep learning model was fine-tuned to extract features from target datasets. Then, the target dataset was clustered into the relative and irrelative image library by exploring the Bayes classifier. Next, the support vector machine (SVM) classifier was used to retrieve similar images in the relative library. Finally, the relevance feedback technique was employed to update the parameters of both classifiers iteratively until the request for the retrieval was met. Results demonstrate that the proposed method achieves 95.87% in classification index F1 - Score, which surpasses that of the suboptimal approach DCNN-BSVM by 6.76%. The performance of the proposed method is superior to that of other approaches considering retrieval criteria as average precision, average recall, and mean average precision. The study indicates that the Bayes + SVM combined classifier accomplishes the optimal quantities more efficiently than only either Bayes or SVM classifier under the transfer learning framework. Transfer learning skillfully excels training from scratch considering the feature extraction modes. This study provides a certain reference for other insights on applications of small- and medium-scale CBIR systems with inadequate samples

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

    Full text link
    High-quality HDRIs(High Dynamic Range Images), typically HDR panoramas, are one of the most popular ways to create photorealistic lighting and 360-degree reflections of 3D scenes in graphics. Given the difficulty of capturing HDRIs, a versatile and controllable generative model is highly desired, where layman users can intuitively control the generation process. However, existing state-of-the-art methods still struggle to synthesize high-quality panoramas for complex scenes. In this work, we propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data. Given a free-form text as the description of the scene, we synthesize the corresponding HDRI with two dedicated steps: 1) text-driven panorama generation in low dynamic range(LDR) and low resolution, and 2) super-resolution inverse tone mapping to scale up the LDR panorama both in resolution and dynamic range. Specifically, to achieve zero-shot text-driven panorama generation, we first build dual codebooks as the discrete representation for diverse environmental textures. Then, driven by the pre-trained CLIP model, a text-conditioned global sampler learns to sample holistic semantics from the global codebook according to the input text. Furthermore, a structure-aware local sampler learns to synthesize LDR panoramas patch-by-patch, guided by holistic semantics. To achieve super-resolution inverse tone mapping, we derive a continuous representation of 360-degree imaging from the LDR panorama as a set of structured latent codes anchored to the sphere. This continuous representation enables a versatile module to upscale the resolution and dynamic range simultaneously. Extensive experiments demonstrate the superior capability of Text2Light in generating high-quality HDR panoramas. In addition, we show the feasibility of our work in realistic rendering and immersive VR.Comment: SIGGRAPH Asia 2022; Project Page https://frozenburning.github.io/projects/text2light/ Codes are available at https://github.com/FrozenBurning/Text2Ligh

    Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground

    Full text link
    We provide a comprehensive evaluation of salient object detection (SOD) models. Our analysis identifies a serious design bias of existing SOD datasets which assumes that each image contains at least one clearly outstanding salient object in low clutter. The design bias has led to a saturated high performance for state-of-the-art SOD models when evaluated on existing datasets. The models, however, still perform far from being satisfactory when applied to real-world daily scenes. Based on our analyses, we first identify 7 crucial aspects that a comprehensive and balanced dataset should fulfill. Then, we propose a new high quality dataset and update the previous saliency benchmark. Specifically, our SOC (Salient Objects in Clutter) dataset, includes images with salient and non-salient objects from daily object categories. Beyond object category annotations, each salient image is accompanied by attributes that reflect common challenges in real-world scenes. Finally, we report attribute-based performance assessment on our dataset.Comment: ECCV 201

    Adapting Stream Processing Framework for Video Analysis

    Get PDF
    AbstractStream processing (SP) became relevant mainly due to inexpensive and hence ubiquitous deployment of sensors in many domains (e.g., environmental monitoring, battle field monitoring). Other continuous data generators (surveillance, traffic data) have also prompted processing and analysis of these streams for applications such as traffic congestion/accidents and personalized marketing. Image processing has been researched for several decades. Recently there is emphasis on video stream analysis for situation monitoring due to the ubiquitous deployment of video cameras and unmanned aerial vehicles for security and other applications.This paper elaborates on the research and development issues that need to be addressed for extending the traditional stream processing framework for video analysis, especially for situation awareness. This entails extensions to: data model, operators and language for expressing complex situations, QoS (Quality of service) specifications and algorithms needed for their satisfaction. Specifically, this paper demonstrates inadequacy of current data representation (e.g., relation and arrable) and querying capabilities to infer long-term research and development issues
    corecore