4,040 research outputs found

    Semantic multimedia modelling & interpretation for annotation

    Get PDF
    The emergence of multimedia enabled devices, particularly the incorporation of cameras in mobile phones, and the accelerated revolutions in the low cost storage devices, boosts the multimedia data production rate drastically. Witnessing such an iniquitousness of digital images and videos, the research community has been projecting the issue of its significant utilization and management. Stored in monumental multimedia corpora, digital data need to be retrieved and organized in an intelligent way, leaning on the rich semantics involved. The utilization of these image and video collections demands proficient image and video annotation and retrieval techniques. Recently, the multimedia research community is progressively veering its emphasis to the personalization of these media. The main impediment in the image and video analysis is the semantic gap, which is the discrepancy among a userโ€™s high-level interpretation of an image and the video and the low level computational interpretation of it. Content-based image and video annotation systems are remarkably susceptible to the semantic gap due to their reliance on low-level visual features for delineating semantically rich image and video contents. However, the fact is that the visual similarity is not semantic similarity, so there is a demand to break through this dilemma through an alternative way. The semantic gap can be narrowed by counting high-level and user-generated information in the annotation. High-level descriptions of images and or videos are more proficient of capturing the semantic meaning of multimedia content, but it is not always applicable to collect this information. It is commonly agreed that the problem of high level semantic annotation of multimedia is still far from being answered. This dissertation puts forward approaches for intelligent multimedia semantic extraction for high level annotation. This dissertation intends to bridge the gap between the visual features and semantics. It proposes a framework for annotation enhancement and refinement for the object/concept annotated images and videos datasets. The entire theme is to first purify the datasets from noisy keyword and then expand the concepts lexically and commonsensical to fill the vocabulary and lexical gap to achieve high level semantics for the corpus. This dissertation also explored a novel approach for high level semantic (HLS) propagation through the images corpora. The HLS propagation takes the advantages of the semantic intensity (SI), which is the concept dominancy factor in the image and annotation based semantic similarity of the images. As we are aware of the fact that the image is the combination of various concepts and among the list of concepts some of them are more dominant then the other, while semantic similarity of the images are based on the SI and concept semantic similarity among the pair of images. Moreover, the HLS exploits the clustering techniques to group similar images, where a single effort of the human experts to assign high level semantic to a randomly selected image and propagate to other images through clustering. The investigation has been made on the LabelMe image and LabelMe video dataset. Experiments exhibit that the proposed approaches perform a noticeable improvement towards bridging the semantic gap and reveal that our proposed system outperforms the traditional systems

    VISIR : visual and semantic image label refinement

    Get PDF
    The social media explosion has populated the Internet with a wealth of images. There are two existing paradigms for image retrieval: 1) content-based image retrieval (CBIR), which has traditionally used visual features for similarity search (e.g., SIFT features), and 2) tag-based image retrieval (TBIR), which has relied on user tagging (e.g., Flickr tags). CBIR now gains semantic expressiveness by advances in deep-learning-based detection of visual labels. TBIR benefits from query-and-click logs to automatically infer more informative labels. However, learning-based tagging still yields noisy labels and is restricted to concrete objects, missing out on generalizations and abstractions. Click-based tagging is limited to terms that appear in the textual context of an image or in queries that lead to a click. This paper addresses the above limitations by semantically refining and expanding the labels suggested by learning-based object detection. We consider the semantic coherence between the labels for different objects, leverage lexical and commonsense knowledge, and cast the label assignment into a constrained optimization problem solved by an integer linear program. Experiments show that our method, called VISIR, improves the quality of the state-of-the-art visual labeling tools like LSDA and YOLO

    Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

    Full text link
    Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.Comment: AAAI 2020 pape

    Towards a Universal Wordnet by Learning from Combined Evidenc

    Get PDF
    Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification

    Target-Tailored Source-Transformation for Scene Graph Generation

    Get PDF
    Scene graph generation aims to provide a semantic and structural description of an image, denoting the objects (with nodes) and their relationships (with edges). The best performing works to date are based on exploiting the context surrounding objects or relations,e.g., by passing information among objects. In these approaches, to transform the representation of source objects is a critical process for extracting information for the use by target objects. In this work, we argue that a source object should give what tar-get object needs and give different objects different information rather than contributing common information to all targets. To achieve this goal, we propose a Target-TailoredSource-Transformation (TTST) method to efficiently propagate information among object proposals and relations. Particularly, for a source object proposal which will contribute information to other target objects, we transform the source object feature to the target object feature domain by simultaneously taking both the source and target into account. We further explore more powerful representations by integrating language prior with the visual context in the transformation for the scene graph generation. By doing so the target object is able to extract target-specific information from the source object and source relation accordingly to refine its representation. Our framework is validated on the Visual Genome bench-mark and demonstrated its state-of-the-art performance for the scene graph generation. The experimental results show that the performance of object detection and visual relation-ship detection are promoted mutually by our method

    ๋งˆ์Šคํฌ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•œ 3์ฐจ์› ์† ์ขŒํ‘œ์˜ ๋ฏธ์„ธ์กฐ์ •

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. ๋ฌธ๋ณ‘๋กœ.3D Hand Pose Estimation ๋ฌธ์ œ๋Š” ํ•œ ์žฅ์˜ 2์ฐจ์› image๋ฅผ ์ด์šฉํ•˜์—ฌ 3์ฐจ์›์˜ ์† ์ขŒํ‘œ๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฌธ์ œ๋กœ, 2์ฐจ์› image์—์„œ ์†์˜ ์ผ๋ถ€๊ฐ€ ๊ฐ€๋ ค์ง€๋Š” ๊ฒฝ์šฐ๋“ค ๋•Œ๋ฌธ์— ํ˜„์กดํ•˜๋Š” ์ ‘๊ทผ๋ฐฉ์‹์œผ๋กœ ํ’€๊ธฐ์— ๊นŒ๋‹ค๋กœ์šด ๋ฌธ์ œ์ด๋‹ค. ์ตœ๊ทผ์— ์—ฐ๊ตฌ์ž๋“ค์€ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๊ฑฐ๋‚˜, ํ•ฉ์„ฑํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋“ค์€ ๊ฐ ๊ด€์ ˆ๋“ค์ด ๋…๋ฆฝ์ ์ด๋ผ ๊ฐ€์ •ํ•˜๊ณ  ๋ฌธ์ œ๋ฅผ ํ’€๊ฑฐ๋‚˜, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ด€์ ˆ๋“ค์˜ ์—ฐ๊ฒฐ ๊ด€๊ณ„๋งŒ ๊ฐ€์ง€๊ณ ์„œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ–ˆ๊ธฐ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์† ๊ด€์ ˆ์˜ 3์ฐจ์› ์ขŒํ‘œ๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •์„ ํ•˜์—ฌ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์„ ๋Œ์–ด ์˜ฌ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ ํ•œ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” BERT ๋ชจ๋ธ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜์œผ๋ฉฐ, BERT๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณด์ด์ง€ ์•Š๋Š” ์†์— ๋Œ€ํ•˜์—ฌ ์ž˜ ์ถ”์ •ํ•˜๋„๋ก ํ•˜๋Š” ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๊ธฐ์กด์— ์žˆ๋˜ ์ ‘๊ทผ๋ฐฉ์‹๋“ค ๋ณด๋‹ค ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์‹คํ—˜์—์„œ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ, ๋ฌผ๋ฆฌ์ ์ธ ๊ด€์ ˆ๊ฐ„์˜ ์—ฐ๊ฒฐ ๊ด€๊ณ„์— ๊ฐ‡ํ˜€์žˆ์ง€ ์•Š๊ณ , ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ ๊ด€์ ˆ๊ฐ„์˜ ์˜ํ–ฅ๋ ฅ์„ ํŒŒ์•…ํ•˜์—ฌ ์œ„์น˜๋ฅผ ์„ธ๋ถ€ ์กฐ์ •ํ•˜๊ฒŒ ํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•™์Šตํ•œ ์—ฐ๊ฒฐ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™” ํ•˜์—ฌ ์ด ๋…ผ๋ฌธ์— ์ผ๋ถ€ ์†Œ๊ฐœํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋ˆˆ์— ๋ณด์ด๋Š” ๋ฌผ๋ฆฌ์ ์ธ ์—ฐ๊ฒฐ๊ด€๊ณ„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ด€๊ณ„์—†์–ด ๋ณด์ด๋Š” ๊ด€์ ˆ๋“ค ๊ฐ„์—๋„ ์˜ํ–ฅ์„ ์ฃผ๊ณ ๋ฐ›๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ์ข‹์€ ์ ‘๊ทผ๋ฐฉ๋ฒ•์ž„์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Accurately estimating hand/body pose from a single viewpoint under occlusion is challenging for most of the current approaches. Recent approaches have tried to address the occlusion problem by collecting or synthesizing images having joint occlusions. However, the data-driven approaches failed to tackle the occlusion because they assumed that joints are independent or they only used explicit joint connection. To mitigate this problem, I propose a method that learns joint relations and refines the occluded information based on their relation. Inspired by BERT in Natural Language Processing, I pre-train a refinement module and add it at the end of the proposed framework. Refinement improves not only the accuracy of occluded joints but also the accuracy of whole joints. In addition, instead of using a physical connection between joints, the proposed model learns their relation from the data. I visualized the learned joint relation in this paper, and it implies that assuming explicit connection hinders the model from accurately predicting joint locations accurately.1 INTRODUCTION 1 2 RELATEDWORKS 3 3 PRELIMINARIES 5 3.1 Attention Mechanism 5 3.2 Transformer 5 3.3 Masked Language Model 6 4 Method 9 4.1 Problem Definition 9 4.2 3D Hand Pose Estimation Framework 9 4.2.1 Dense Representation Module 10 4.2.2 3D Regression Module 10 4.2.3 Joint Refinement Module 11 4.3 Pre-training 11 4.3.1 Stacked HourGlass 12 4.3.2 Joint Refinement Module 13 4.4 Training 13 5 EXPERIMENTS 17 5.1 Dataset 17 5.2 Experimental Results 18 5.2.1 Quantative Results 18 5.2.2 Qualitative Results 18 5.2.3 Computational Complexity 19 6 CONCLUSION 26 Bibliography 27 Abstract (In Korean) 34์„
    • โ€ฆ
    corecore