2,019 research outputs found
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Transfer learning aims to reduce the amount of data required to excel at a
new task by re-using the knowledge acquired from learning other related tasks.
This paper proposes a novel transfer learning scenario, which distills robust
phonetic features from grounding models that are trained to tell whether a pair
of image and speech are semantically correlated, without using any textual
transcripts. As semantics of speech are largely determined by its lexical
content, grounding models learn to preserve phonetic information while
disregarding uncorrelated factors, such as speaker and channel. To study the
properties of features distilled from different layers, we use them as input
separately to train multiple speech recognition models. Empirical results
demonstrate that layers closer to input retain more phonetic information, while
following layers exhibit greater invariance to domain shift. Moreover, while
most previous studies include training data for speech recognition for feature
extractor training, our grounding models are not trained on any of those data,
indicating more universal applicability to new domains.Comment: Accepted to Interspeech 2019. 4 pages, 2 figure
ICONCLASS - Klasifikacijski sustav za umjetnost i ikonografiju
Documenting is a crucial activity for any museum or art institution. Today, that importance is growing for the metadata museum provides us with, is essential in retrieving information in the vast amount of data of the modern world. The goal of this study is to discuss the design of thesauri, how they work and what is their purpose in documenting museum objects. It further discusses content indexing together with aboutness, isness and ofness, to draw a parallel with Panofsky’s categories in iconography. The central focus of the work falls onto analyzing Iconclass, its features, and usage. Additionally, it concentrates on new developments in machine learning within artificial intelligence, which use Iconclass to generate and automatize new data and connections. Finally, it gives a brief overview of folksonomy and social tagging.Dokumentiranje je ključna aktivnost svakog muzeja ili umjetničke institucije. Danas ta važnost raste jer metapodaci koje nam muzej pruža igraju bitnu ulogu u pronalaženju informacija u ogromnoj količini podataka suvremenog svijeta. Cilj ovog rada je predstaviti i raspravljati o dizajnu tezaurusa, kako oni rade i koja je njihova svrha u dokumentiranju muzejskih objekata. Nadalje se takodjer predstavlja sadržajnu obradu zajedno s sustinom, postojanoscu i svojstvom (aboutness, isness, ofness) kako bi se usporedila s Panofskijevim kategorijama u ikonografiji. Središnji fokus rada je analiziranje Iconclass-a, njegovih značajki i upotrebe. Osim toga, rad se usredotočuje na nove razvoje u strojnom učenju preko umjetne inteligencije, koji koriste Iconclass za generiranje i automatizaciju novih podataka i veza. Na kraju, daje se kratak pregled folksonomije i socijalnog označavanja
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
Image Retrieval Method Combining Bayes and SVM Classifier Based on Relevance Feedback with Application to Small-scale Datasets
A vast amount of images has been generated due to the diversity and digitalization of devices for image acquisition. However, the gap between low-level visual features and high-level semantic representations has been a major concern that hinders retrieval accuracy. A retrieval method based on the transfer learning model and the relevance feedback technique was formulated in this study to optimize the dynamic trade-off between the structural complexity and retrieval performance of the small- and medium-scale content-based image retrieval (CBIR) system. First, the pretrained deep learning model was fine-tuned to extract features from target datasets. Then, the target dataset was clustered into the relative and irrelative image library by exploring the Bayes classifier. Next, the support vector machine (SVM) classifier was used to retrieve similar images in the relative library. Finally, the relevance feedback technique was employed to update the parameters of both classifiers iteratively until the request for the retrieval was met. Results demonstrate that the proposed method achieves 95.87% in classification index F1 - Score, which surpasses that of the suboptimal approach DCNN-BSVM by 6.76%. The performance of the proposed method is superior to that of other approaches considering retrieval criteria as average precision, average recall, and mean average precision. The study indicates that the Bayes + SVM combined classifier accomplishes the optimal quantities more efficiently than only either Bayes or SVM classifier under the transfer learning framework. Transfer learning skillfully excels training from scratch considering the feature extraction modes. This study provides a certain reference for other insights on applications of small- and medium-scale CBIR systems with inadequate samples
Recent Trends in Computational Intelligence
Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications
Text2Light: Zero-Shot Text-Driven HDR Panorama Generation
High-quality HDRIs(High Dynamic Range Images), typically HDR panoramas, are
one of the most popular ways to create photorealistic lighting and 360-degree
reflections of 3D scenes in graphics. Given the difficulty of capturing HDRIs,
a versatile and controllable generative model is highly desired, where layman
users can intuitively control the generation process. However, existing
state-of-the-art methods still struggle to synthesize high-quality panoramas
for complex scenes. In this work, we propose a zero-shot text-driven framework,
Text2Light, to generate 4K+ resolution HDRIs without paired training data.
Given a free-form text as the description of the scene, we synthesize the
corresponding HDRI with two dedicated steps: 1) text-driven panorama generation
in low dynamic range(LDR) and low resolution, and 2) super-resolution inverse
tone mapping to scale up the LDR panorama both in resolution and dynamic range.
Specifically, to achieve zero-shot text-driven panorama generation, we first
build dual codebooks as the discrete representation for diverse environmental
textures. Then, driven by the pre-trained CLIP model, a text-conditioned global
sampler learns to sample holistic semantics from the global codebook according
to the input text. Furthermore, a structure-aware local sampler learns to
synthesize LDR panoramas patch-by-patch, guided by holistic semantics. To
achieve super-resolution inverse tone mapping, we derive a continuous
representation of 360-degree imaging from the LDR panorama as a set of
structured latent codes anchored to the sphere. This continuous representation
enables a versatile module to upscale the resolution and dynamic range
simultaneously. Extensive experiments demonstrate the superior capability of
Text2Light in generating high-quality HDR panoramas. In addition, we show the
feasibility of our work in realistic rendering and immersive VR.Comment: SIGGRAPH Asia 2022; Project Page
https://frozenburning.github.io/projects/text2light/ Codes are available at
https://github.com/FrozenBurning/Text2Ligh
Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground
We provide a comprehensive evaluation of salient object detection (SOD)
models. Our analysis identifies a serious design bias of existing SOD datasets
which assumes that each image contains at least one clearly outstanding salient
object in low clutter. The design bias has led to a saturated high performance
for state-of-the-art SOD models when evaluated on existing datasets. The
models, however, still perform far from being satisfactory when applied to
real-world daily scenes. Based on our analyses, we first identify 7 crucial
aspects that a comprehensive and balanced dataset should fulfill. Then, we
propose a new high quality dataset and update the previous saliency benchmark.
Specifically, our SOC (Salient Objects in Clutter) dataset, includes images
with salient and non-salient objects from daily object categories. Beyond
object category annotations, each salient image is accompanied by attributes
that reflect common challenges in real-world scenes. Finally, we report
attribute-based performance assessment on our dataset.Comment: ECCV 201
Adapting Stream Processing Framework for Video Analysis
AbstractStream processing (SP) became relevant mainly due to inexpensive and hence ubiquitous deployment of sensors in many domains (e.g., environmental monitoring, battle field monitoring). Other continuous data generators (surveillance, traffic data) have also prompted processing and analysis of these streams for applications such as traffic congestion/accidents and personalized marketing. Image processing has been researched for several decades. Recently there is emphasis on video stream analysis for situation monitoring due to the ubiquitous deployment of video cameras and unmanned aerial vehicles for security and other applications.This paper elaborates on the research and development issues that need to be addressed for extending the traditional stream processing framework for video analysis, especially for situation awareness. This entails extensions to: data model, operators and language for expressing complex situations, QoS (Quality of service) specifications and algorithms needed for their satisfaction. Specifically, this paper demonstrates inadequacy of current data representation (e.g., relation and arrable) and querying capabilities to infer long-term research and development issues
- …