283 research outputs found
Effective Graph-Based Content--Based Image Retrieval Systems for Large-Scale and Small-Scale Image Databases
This dissertation proposes two novel manifold graph-based ranking systems for Content-Based Image Retrieval (CBIR). The two proposed systems exploit the synergism between relevance feedback-based transductive short-term learning and semantic feature-based long-term learning to improve retrieval performance. Proposed systems first apply the active learning mechanism to construct users\u27 relevance feedback log and extract high-level semantic features for each image. These systems then create manifold graphs by incorporating both the low-level visual similarity and the high-level semantic similarity to achieve more meaningful structures for the image space. Finally, asymmetric relevance vectors are created to propagate relevance scores of labeled images to unlabeled images via manifold graphs. The extensive experimental results demonstrate two proposed systems outperform the other state-of-the-art CBIR systems in the context of both correct and erroneous users\u27 feedback
Sampling-Based Methods for Factored Task and Motion Planning
This paper presents a general-purpose formulation of a large class of
discrete-time planning problems, with hybrid state and control-spaces, as
factored transition systems. Factoring allows state transitions to be described
as the intersection of several constraints each affecting a subset of the state
and control variables. Robotic manipulation problems with many movable objects
involve constraints that only affect several variables at a time and therefore
exhibit large amounts of factoring. We develop a theoretical framework for
solving factored transition systems with sampling-based algorithms. The
framework characterizes conditions on the submanifold in which solutions lie,
leading to a characterization of robust feasibility that incorporates
dimensionality-reducing constraints. It then connects those conditions to
corresponding conditional samplers that can be composed to produce values on
this submanifold. We present two domain-independent, probabilistically complete
planning algorithms that take, as input, a set of conditional samplers. We
demonstrate the empirical efficiency of these algorithms on a set of
challenging task and motion planning problems involving picking, placing, and
pushing
μ μ¬ μλ² λ©μ ν΅ν μκ°μ μ€ν 리λ‘λΆν°μ μμ¬ ν μ€νΈ μμ±κΈ° νμ΅
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2019. 2. μ₯λ³ν.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed
worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation.
Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos.
The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has
convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges.
The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting.
The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result.
Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.μ€ν 리λ₯Ό μ΄ν΄νλ λ₯λ ₯μ λλ¬Όλ€ λΏλ§ μλλΌ λ€λ₯Έ μ μΈμκ³Ό μΈλ₯λ₯Ό ꡬλ³μ§λ μ€μν λ₯λ ₯μ΄λ€. μΈκ³΅μ§λ₯μ΄ μΌμμν μμμ μ¬λλ€κ³Ό ν¨κ» μ§λ΄λ©΄μ κ·Έλ€μ μν μ λ§₯λ½μ μ΄ν΄νκΈ° μν΄μλ μ€ν 리λ₯Ό μ΄ν΄νλ λ₯λ ₯μ΄ λ§€μ° μ€μνλ€. νμ§λ§,
κΈ°μ‘΄μ μ€ν 리μ κ΄ν μ°κ΅¬λ μΈμ΄μ²λ¦¬μ μ΄λ €μμΌλ‘ μΈν΄ μ¬μ μ μ μλ μΈκ³ λͺ¨λΈ νμμ μ’μ νμ§μ μ μλ¬Όμ μμ±νλ €λ κΈ°μ μ΄ μ£Όλ‘ μ°κ΅¬λμ΄ μλ€. κΈ°κ³νμ΅ κΈ°λ²μ ν΅ν΄ μ€ν 리λ₯Ό λ€λ£¨λ €λ μλλ€μ λμ²΄λ‘ μμ°μ΄λ‘ ννλ λ°μ΄ν°μ κΈ°λ°ν μ λ°μ μμ΄ μμ°μ΄ μ²λ¦¬μμ κ²ͺλ λ¬Έμ λ€μ λμΌνκ² κ²ͺλλ€. μ΄λ₯Ό 극볡νκΈ° μν΄μλ μκ°μ μ λ³΄κ° ν¨κ» μ°λλ λ°μ΄ν°κ° λμμ΄ λ μ μλ€. μ΅κ·Ό λ₯λ¬λμ λλΆμ λ°μ μ νμ
μ΄ μκ°κ³Ό μΈμ΄ μ¬μ΄μ κ΄κ³λ₯Ό λ€λ£¨λ μ°κ΅¬λ€μ΄ λμ΄λκ³
μλ€. μ°κ΅¬μ λΉμ μΌλ‘μ, μΈκ³΅μ§λ₯ μμ΄μ νΈκ° μ£Όλ³ μ 보λ₯Ό μΉ΄λ©λΌλ‘ μ
λ ₯λ°λ νκ²½ μμ λμ¬μλ μν©μ μκ°ν΄ λ³Ό μ μλ€. μ΄ μμμ μΈκ³΅μ§λ₯ μμ΄μ νΈλ μ£Όλ³μ κ΄μ°°νλ©΄μ κ·Έμ λν μ€ν 리λ₯Ό μμ°μ΄ ννλ‘ μμ±νκ³ , μμ±λ μ€ν 리λ₯Ό
λ°νμΌλ‘ λ€μμ μΌμ΄λ μ€ν 리λ₯Ό ν λ¨κ³μμ μ¬λ¬ λ¨κ³κΉμ§ μμΈ‘ν μ μλ€. λ³Έ νμ λ
Όλ¬Έμμλ μ¬μ§ λ° λΉλμ€ μμ λνλλ μ€ν 리(visual story)λ₯Ό νμ΅νλ λ°©λ², λ΄λ¬ν°λΈ ν
μ€νΈλ‘μ λ³ν, κ°λ €μ§ μ¬κ±΄ λ° λ€μ μ¬κ±΄μ μΆλ‘ νλ μ°κ΅¬λ€μ
λ€λ£¬λ€.
첫 λ²μ§Έλ‘, μ¬λ¬ μ₯μ μ¬μ§μ΄ μ£Όμ΄μ‘μ λ μ΄λ₯Ό λ°νμΌλ‘ μ€ν 리 ν
μ€νΈλ₯Ό μμ±νλ λ¬Έμ (λΉμ£ΌμΌ μ€ν 리ν
λ§)λ₯Ό λ€λ£¬λ€. μ΄ λ¬Έμ ν΄κ²°μ μν΄ κΈλλ·(GLAC Net)μ μ μνμλ€. λ¨Όμ , μ¬μ§λ€λ‘λΆν° μ 보λ₯Ό μΆμΆνκΈ° μν 컨볼루μ
μ κ²½λ§, λ¬Έμ₯μ
μμ±νκΈ° μν΄ μνμ κ²½λ§μ μ΄μ©νλ€. μνμ€-μνμ€ κ΅¬μ‘°μ μΈμ½λλ‘μ, μ 체μ μΈ μ΄μΌκΈ° ꡬ쑰μ ννμ μν΄ λ€κ³μΈ΅ μλ°©ν₯ μνμ κ²½λ§μ λ°°μΉνλ κ° μ¬μ§ λ³ μ 보λ₯Ό ν¨κ» μ΄μ©νκΈ° μν΄ μ μμ -κ΅λΆμ μ£Όμμ§μ€ λͺ¨λΈμ μ μνμλ€. λν,
μ¬λ¬ λ¬Έμ₯μ μμ±νλ λμ λ§₯λ½μ 보μ κ΅λΆμ 보λ₯Ό μμ§ μκ² νκΈ° μν΄ μμ λ¬Έμ₯ μ 보λ₯Ό μ λ¬νλ λ©μ»€λμ¦μ μ μνμλ€. μ μ μ λ°©λ²μΌλ‘ λΉμ€νΈ(VIST) λ°μ΄ν° μ§ν©μ νμ΅νμκ³ , μ 1 ν μκ°μ μ€ν 리ν
λ§ λν(visual storytelling challenge)μμ μ¬λ νκ°λ₯Ό κΈ°μ€μΌλ‘ μ 체 μ μ λ° 6 νλͺ© λ³λ‘ λͺ¨λ μ΅κ³ μ μ λ°μλ€.
λ λ²μ§Έλ‘, μ€ν 리μ μΌλΆκ° λ¬Έμ₯λ€λ‘ μ£Όμ΄μ‘μ λ μ΄λ₯Ό λ°νμΌλ‘ λ€μ λ¬Έμ₯μ μμΈ‘νλ λ¬Έμ λ₯Ό λ€λ£¬λ€. μμμ κΈΈμ΄μ μ€ν 리μ λν΄ μμμ μμΉμμ μμΈ‘μ΄ κ°λ₯ν΄μΌ νκ³ , μμΈ‘νλ €λ λ¨κ³ μμ 무κ΄νκ² μλν΄μΌ νλ€. μ΄λ₯Ό μν λ°©λ²μΌλ‘
μν μ¬κ±΄ μΈμΆ λͺ¨λΈ(Recurrent Event Retrieval Models)μ μ μνμλ€. μ΄ λ°©λ²μ μλ κ³΅κ° μμμ νμ¬κΉμ§ λμ λ λ§₯λ½κ³Ό λ€μμ λ°μν μ λ ₯ μ¬κ±΄ μ¬μ΄μ 거리λ₯Ό κ°κΉκ² νλλ‘ λ§₯λ½λμ ν¨μμ λ κ°μ μλ² λ© ν¨μλ₯Ό νμ΅νλ€. μ΄λ₯Ό ν΅ν΄ μ΄λ―Έ μ
λ ₯λμ΄ μλ μ€ν 리μ μλ‘μ΄ μ¬κ±΄μ΄ μ
λ ₯λλ©΄ μμ νμ μ°μ°μ ν΅ν΄ κΈ°μ‘΄μ λ§₯λ½μ κ°μ νμ¬ λ€μμ λ°μν μ λ ₯ν μ¬κ±΄λ€μ μ°Ύλλ€. μ΄ λ°©λ²μΌλ‘ λ½μ€ν 리(ROCStories) λ°μ΄ν°μ§ν©μ νμ΅νμκ³ , μ€ν 리 ν΄λ‘μ¦ ν
μ€νΈ(Story Cloze Test)λ₯Ό ν΅ν΄ νκ°ν κ²°κ³Ό κ²½μλ ₯ μλ μ±λ₯μ 보μμΌλ©°, νΉν μμμ κΈΈμ΄λ‘ μΆλ‘ ν μ μλ κΈ°λ² μ€μ μ΅κ³ μ±λ₯μ 보μλ€.
μΈ λ²μ§Έλ‘, λΉλμ€ μ€ν 리μμ μ¬κ±΄ μνμ€ μ€ μΌλΆκ° κ°λ €μ‘μ λ μ΄λ₯Ό 볡ꡬνλ λ¬Έμ λ₯Ό λ€λ£¬λ€. νΉν, κ° μ¬κ±΄μ μλ―Έ μ 보μ μμλ₯Ό λͺ¨λΈμ νν νμ΅μ λ°μνκ³ μ νμλ€. μ΄λ₯Ό μν΄ μλ κ³΅κ° μμ κ° μνΌμλλ€μ κΆ€μ ννλ‘ μλ² λ©νκ³ ,
μ΄λ₯Ό λ°νμΌλ‘ μ€ν 리λ₯Ό μ¬μμ±μ νμ¬ μ€ν 리 μμ±μ ν μ μλ λͺ¨λΈμΈ λΉμ€ν 리λ·(ViStoryNet)μ μ μνμλ€. κ° μνΌμλλ₯Ό κΆ€μ ννλ₯Ό κ°μ§κ² νκΈ° μν΄ μ¬κ±΄ λ¬Έμ₯μ μ¬κ³ 벑ν°(thought vector)λ‘ λ³ννκ³ , μ°μ μ΄λ²€νΈ μμ μλ² λ©μ
ν΅ν΄ μ ν μ¬κ±΄λ€μ΄ μλ‘ κ°κΉκ² μλ² λ©λλλ‘ νμ¬ νλμ μνΌμλκ° κΆ€μ μ λͺ¨μμ κ°μ§λλ‘ νμ΅νμλ€. λ½λ‘λ‘QA λ°μ΄ν°μ§ν©μ ν΅ν΄ μ€νμ μΌλ‘ κ²°κ³Όλ₯Ό νμΈνμλ€. μλ² λ© λ μνΌμλλ€μ κΆ€μ ννλ‘ μ λνλ¬μΌλ©°, μνΌμλλ€μ μ¬μμ± ν΄λ³Έ κ²°κ³Ό μ 체μ μΈ μΈ‘λ©΄μμ μ μ¬ν κ²°κ³Όλ₯Ό 보μλ€.
μ κ²°κ³Όλ¬Όλ€μ μΉ΄λ©λΌλ‘ μ
λ ₯λλ μ£Όλ³ μ 보λ₯Ό λ°νμΌλ‘ μ€ν 리λ₯Ό μ΄ν΄νκ³ μΌλΆ κ΄μΈ‘λμ§ μμ λΆλΆμ μΆλ‘ νλ©°, ν₯ν μ€ν 리λ₯Ό μμΈ‘νλ λ°©λ²λ€μ λμλλ€.Abstract i
Chapter 1 Introduction 1
1.1 Story of Everyday lives in Videos and Story Understanding . . . 1
1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Background and Related Work 10
2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14
2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15
2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17
2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18
2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19
Chapter 3 Visual Storytelling via Global-local Attention Cascading
Networks 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26
3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27
3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28
3.3.2 Decoder: Story Generator with Attention and Cascading
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36
3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 4 Common Space Learning on Cumulative Contexts
and the Next Events: Recurrent Event Retrieval
Models 44
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45
4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5 ViStoryNet: Order Embedding of Successive Events
and the Networks for Story Regeneration 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60
5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62
5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64
5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67
5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68
5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68
5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73
5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 6 Concluding Remarks 80
6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80
6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81
μ΄λ‘ 101Docto
Hyperbolic Deep Neural Networks: A Survey
Recently, there has been a rising surge of momentum for deep representation
learning in hyperbolic spaces due to theirhigh capacity of modeling data like
knowledge graphs or synonym hierarchies, possessing hierarchical structure. We
refer to the model as hyperbolic deep neural network in this paper. Such a
hyperbolic neural architecture potentially leads to drastically compact model
withmuch more physical interpretability than its counterpart in Euclidean
space. To stimulate future research, this paper presents acoherent and
comprehensive review of the literature around the neural components in the
construction of hyperbolic deep neuralnetworks, as well as the generalization
of the leading deep approaches to the Hyperbolic space. It also presents
current applicationsaround various machine learning tasks on several publicly
available datasets, together with insightful observations and identifying
openquestions and promising future directions
λμ’ , μ΄μ’ , κ·Έλ¦¬κ³ λ무 ννμ κ·Έλνλ₯Ό μν λΉμ§λ νν νμ΅
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2022. 8. μ΅μ§μ.κ·Έλν λ°μ΄ν°μ λν λΉμ§λ νν νμ΅μ λͺ©μ μ κ·Έλνμ ꡬ쑰μ λ
Έλμ μμ±μ μ λ°μνλ μ μ©ν λ
Έλ λ¨μ νΉμ κ·Έλν λ¨μμ λ²‘ν° νν ννμ νμ΅νλ κ²μ΄λ€. μ΅κ·Ό, κ·Έλν λ°μ΄ν°μ λν΄ κ°λ ₯ν νν νμ΅ λ₯λ ₯μ κ°μΆ κ·Έλν μ κ²½λ§μ νμ©ν λΉμ§λ κ·Έλν νν νμ΅ λͺ¨λΈμ μ€κ³κ° μ£Όλͺ©μ λ°κ³ μλ€. λ§μ λ°©λ²λ€μ ν μ’
λ₯μ μ£μ§μ ν μ’
λ₯μ λ
Έλκ° μ‘΄μ¬νλ λμ’
κ·Έλνμ λν νμ΅μ μ§μ€μ νλ€. νμ§λ§ μ΄ μΈμμ μλ§μ μ’
λ₯μ κ΄κ³κ° μ‘΄μ¬νκΈ° λλ¬Έμ, κ·Έλν λν ꡬ쑰μ , μλ―Έλ‘ μ μμ±μ ν΅ν΄ λ€μν μ’
λ₯λ‘ λΆλ₯ν μ μλ€. κ·Έλμ, κ·Έλνλ‘λΆν° μ μ©ν ννμ νμ΅νκΈ° μν΄μλ λΉμ§λ νμ΅ νλ μμν¬λ μ
λ ₯ κ·Έλνμ νΉμ§μ μ λλ‘ κ³ λ €ν΄μΌλ§ νλ€. λ³Έ νμλ
Όλ¬Έμμ μ°λ¦¬λ λ리 μ ν μ μλ μΈκ°μ§ κ·Έλν κ΅¬μ‘°μΈ λμ’
κ·Έλν, νΈλ¦¬ ννμ κ·Έλν, κ·Έλ¦¬κ³ μ΄μ’
κ·Έλνμ λν κ·Έλν μ κ²½λ§μ νμ©νλ λΉμ§λ νμ΅ λͺ¨λΈλ€μ μ μνλ€.
μ²μμΌλ‘, μ°λ¦¬λ λμ’
κ·Έλνμ λ
Έλμ λνμ¬ μ μ°¨μ ννμ νμ΅νλ κ·Έλν 컨볼루μ
μ€ν μΈμ½λ λͺ¨λΈμ μ μνλ€. κΈ°μ‘΄μ κ·Έλν μ€ν μΈμ½λλ ꡬ쑰μ μ μ²΄κ° νμ΅μ΄ λΆκ°λ₯ν΄μ μ νμ μΈ νν νμ΅ λ₯λ ₯μ κ°μ§ μ μλ λ°λ©΄μ, μ μνλ μ€ν μΈμ½λλ λ
Έλμ νΌμ³λ₯Ό 볡μνλ©°,ꡬ쑰μ μ μ²΄κ° νμ΅μ΄ κ°λ₯νλ€. λ
Έλμ νΌμ³λ₯Ό 볡μνκΈ° μν΄μ, μ°λ¦¬λ μΈμ½λ λΆλΆμ μν μ΄ μ΄μν λ
ΈλλΌλ¦¬ μ μ¬ν ννμ κ°μ§κ² νλ λΌνλΌμμ μ€λ¬΄λ©μ΄λΌλ κ²μ μ£Όλͺ©νμ¬ λμ½λ λΆλΆμμλ μ΄μ λ
Έλμ ννκ³Ό λ©μ΄μ§κ² νλ λΌνλΌμμ μ€νλμ νλλ‘ μ€κ³νμλ€. λν λΌνλΌμμ μ€νλμ κ·Έλλ‘ μ μ©νλ©΄ λΆμμ μ±μ μ λ°ν μ μκΈ° λλ¬Έμ, μ£μ§μ κ°μ€μΉ κ°μ μμ κ°μ μ€ μ μλ λΆνΈν κ·Έλνλ₯Ό νμ©νμ¬ μμ μ μΈ λΌνλΌμμ μ€νλμ ννλ₯Ό μ μνμλ€. λμ’
κ·Έλνμ λν λ
Έλ ν΄λ¬μ€ν°λ§κ³Ό λ§ν¬ μμΈ‘ μ€νμ ν΅νμ¬ μ μνλ λ°©λ²μ΄ μμ μ μΌλ‘ μ°μν μ±λ₯μ 보μμ νμΈνμλ€.
λμ§Έλ‘, μ°λ¦¬λ νΈλ¦¬μ ννλ₯Ό κ°μ§λ κ³μΈ΅μ μΈ κ΄κ³λ₯Ό κ°μ§κ³ μλ κ·Έλνμ λ
Έλ ννμ μ ννκ² νμ΅νκΈ° μνμ¬ μ곑μ 곡κ°μμ λμνλ μ€ν μΈμ½λ λͺ¨λΈμ μ μνλ€. μ ν΄λ¦¬λμΈ κ³΅κ°μ νΈλ¦¬λ₯Ό μ¬μνκΈ°μ λΆμ μ νλ€λ μ΅κ·Όμ λΆμμ ν΅νμ¬, μ곑μ 곡κ°μμ κ·Έλν μ κ²½λ§μ λ μ΄μ΄λ₯Ό νμ©νμ¬ λ
Έλμ μ μ°¨μ ννμ νμ΅νκ² λλ€. μ΄ λ, κ·Έλν μ κ²½λ§μ΄ μ곑μ κΈ°ννμμ κ³μΈ΅ μ 보λ₯Ό λ΄κ³ μλ 거리μ κ°μ νμ©νμ¬ λ
Έλμ μ΄μμ¬μ΄μ μ€μλλ₯Ό νμ©νλλ‘ μ€κ³νμλ€. μ°λ¦¬λ λ
Όλ¬Έ μΈμ© κ΄κ³ λ€νΈμν¬, κ³ν΅λ, μ΄λ―Έμ§ μ¬μ΄μ λ€νΈμν¬λ±μ λν΄ μ μν λͺ¨λΈμ μ μ©νμ¬ λ
Έλ ν΄λ¬μ€ν°λ§κ³Ό λ§ν¬ μμΈ‘ μ€νμ νμμΌλ©°, νΈλ¦¬μ ννλ₯Ό κ°μ§λ κ·Έλνμ λν΄μ μ μν λͺ¨λΈμ΄ μ ν΄λ¦¬λμΈ κ³΅κ°μμ μννλ λͺ¨λΈμ λΉν΄ ν₯μλ μ±λ₯μ 보μλ€λ κ²μ νμΈνμλ€.
λ§μ§λ§μΌλ‘, μ°λ¦¬λ μ¬λ¬ μ’
λ₯μ λ
Έλμ μ£μ§λ₯Ό κ°μ§λ μ΄μ’
κ·Έλνμ λν λμ‘° νμ΅ λͺ¨λΈμ μ μνλ€. μ°λ¦¬λ κΈ°μ‘΄μ λ°©λ²λ€μ΄ νμ΅νκΈ° μ΄μ μ μΆ©λΆν λλ©μΈ μ§μμ μ¬μ©νμ¬ μ€κ³ν λ©νν¨μ€λ λ©νκ·Έλνμ μμ‘΄νλ€λ λ¨μ κ³Ό λ§μ μ΄μ’
κ·Έλνμ μ£μ§κ° λ€λ₯Έ λ
Έλ μ’
λ₯μ¬μ΄μ κ΄κ³μ μ§μ€νκ³ μλ€λ μ μ μ£Όλͺ©νμλ€. μ΄λ₯Ό ν΅ν΄ μ°λ¦¬λ μ¬μ κ³Όμ μ΄ νμμμΌλ©° λ€λ₯Έ μ’
λ₯ μ¬μ΄μ κ΄κ³μ λνμ¬ κ°μ μ’
λ₯ μ¬μ΄μ κ΄κ³λ λμμ ν¨μ¨μ μΌλ‘ νμ΅νκ² νλ λ©νλ
ΈλλΌλ κ°λ
μ μ μνμλ€. λν λ©νλ
Έλλ₯Ό κΈ°λ°μΌλ‘νλ κ·Έλν μ κ²½λ§κ³Ό λμ‘° νμ΅ λͺ¨λΈμ μ μνμλ€. μ°λ¦¬λ μ μν λͺ¨λΈμ λ©νν¨μ€λ₯Ό μ¬μ©νλ μ΄μ’
κ·Έλν νμ΅ λͺ¨λΈκ³Ό λ
Έλ ν΄λ¬μ€ν°λ§ λ±μ μ€ν μ±λ₯μΌλ‘ λΉκ΅ν΄λ³΄μμ λ, λΉλ±νκ±°λ λμ μ±λ₯μ 보μμμ νμΈνμλ€.The goal of unsupervised graph representation learning is extracting useful node-wise or graph-wise vector representation that is aware of the intrinsic structures of the graph and its attributes. These days, designing methodology of unsupervised graph representation learning based on graph neural networks has growing attention due to their powerful representation ability. Many methods are focused on a homogeneous graph that is a network with a single type of node and a single type of edge. However, as many types of relationships exist in this world, graphs can also be classified into various types by structural and semantic properties. For this reason, to learn useful representations from graphs, the unsupervised learning framework must consider the characteristics of the input graph. In this dissertation, we focus on designing unsupervised learning models using graph neural networks for three graph structures that are widely available: homogeneous graphs, tree-like graphs, and heterogeneous graphs.
First, we propose a symmetric graph convolutional autoencoder which produces a low-dimensional latent representation from a homogeneous graph. In contrast to the existing graph autoencoders with asymmetric decoder parts, the proposed autoencoder has a newly designed decoder which builds a completely symmetric autoencoder form. For the reconstruction of node features, the decoder is designed based on Laplacian sharpening as the counterpart of Laplacian smoothing of the encoder, which allows utilizing the graph structure in the whole processes of the proposed autoencoder architecture. In order to prevent the numerical instability of the network caused by the Laplacian sharpening introduction, we further propose a new numerically stable form of the Laplacian sharpening by incorporating the signed graphs. The experimental results of clustering, link prediction and visualization tasks on homogeneous graphs strongly support that the proposed model is stable and outperforms various state-of-the-art algorithms.
Second, we analyze how unsupervised tasks can benefit from learned representations in hyperbolic space. To explore how well the hierarchical structure of unlabeled data can be represented in hyperbolic spaces, we design a novel hyperbolic message passing autoencoder whose overall auto-encoding is performed in hyperbolic space. The proposed model conducts auto-encoding the networks via fully utilizing hyperbolic geometry in message passing. Through extensive quantitative and qualitative analyses, we validate the properties and benefits of the unsupervised hyperbolic representations of tree-like graphs.
Third, we propose the novel concept of metanode for message passing to learn both heterogeneous and homogeneous relationships between any two nodes without meta-paths and meta-graphs. Unlike conventional methods, metanodes do not require a predetermined step to manipulate the given relations between different types to enrich relational information. Going one step further, we propose a metanode-based message passing layer and a contrastive learning model using the proposed layer. In our experiments, we show the competitive performance of the proposed metanode-based message passing method on node clustering and node classification tasks, when compared to state-of-the-art methods for message passing networks for heterogeneous graphs.1 Introduction 1
2 Representation Learning on Graph-Structured Data 4
2.1 Basic Introduction 4
2.1.1 Notations 5
2.2 Traditional Approaches 5
2.2.1 Graph Statistic 5
2.2.2 Neighborhood Overlap 7
2.2.3 Graph Kernel 9
2.2.4 Spectral Approaches 10
2.3 Node Embeddings I: Factorization and Random Walks 15
2.3.1 Factorization-based Methods 15
2.3.2 Random Walk-based Methods 16
2.4 Node Embeddings II: Graph Neural Networks 17
2.4.1 Overview of Framework 17
2.4.2 Representative Models 18
2.5 Learning in Unsupervised Environments 21
2.5.1 Predictive Coding 21
2.5.2 Contrastive Coding 22
2.6 Applications 24
2.6.1 Classifications 24
2.6.2 Link Prediction 26
3 Autoencoder Architecture for Homogeneous Graphs 27
3.1 Overview 27
3.2 Preliminaries 30
3.2.1 Spectral Convolution on Graphs 30
3.2.2 Laplacian Smoothing 32
3.3 Methodology 33
3.3.1 Laplacian Sharpening 33
3.3.2 Numerically Stable Laplacian Sharpening 34
3.3.3 Subspace Clustering Cost for Image Clustering 37
3.3.4 Training 39
3.4 Experiments 40
3.4.1 Datasets 40
3.4.2 Experimental Settings 42
3.4.3 Comparing Methods 42
3.4.4 Node Clustering 43
3.4.5 Image Clustering 45
3.4.6 Ablation Studies 46
3.4.7 Link Prediction 47
3.4.8 Visualization 47
3.5 Summary 49
4 Autoencoder Architecture for Tree-like Graphs 50
4.1 Overview 50
4.2 Preliminaries 52
4.2.1 Hyperbolic Embeddings 52
4.2.2 Hyperbolic Geometry 53
4.3 Methodology 55
4.3.1 Geometry-Aware Message Passing 56
4.3.2 Nonlinear Activation 57
4.3.3 Loss Function 58
4.4 Experiments 58
4.4.1 Datasets 59
4.4.2 Compared Methods 61
4.4.3 Experimental Details 62
4.4.4 Node Clustering and Link Prediction 64
4.4.5 Image Clustering 66
4.4.6 Structure-Aware Unsupervised Embeddings 68
4.4.7 Hyperbolic Distance to Filter Training Samples 71
4.4.8 Ablation Studies 74
4.5 Further Discussions 75
4.5.1 Connection to Contrastive Learning 75
4.5.2 Failure Cases of Hyperbolic Embedding Spaces 75
4.6 Summary 77
5 Contrastive Learning for Heterogeneous Graphs 78
5.1 Overview 78
5.2 Preliminaries 82
5.2.1 Meta-path 82
5.2.2 Representation Learning on Heterogeneous Graphs 82
5.2.3 Contrastive methods for Heterogeneous Graphs 83
5.3 Methodology 84
5.3.1 Definitions 84
5.3.2 Metanode-based Message Passing Layer 86
5.3.3 Contrastive Learning Framework 88
5.4 Experiments 89
5.4.1 Experimental Details 90
5.4.2 Node Classification 94
5.4.3 Node Clustering 96
5.4.4 Visualization 96
5.4.5 Effectiveness of Metanodes 97
5.5 Summary 99
6 Conclusions 101λ°
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
- β¦