127 research outputs found
Disentangled Representation Learning
Disentangled Representation Learning (DRL) aims to learn a model capable of
identifying and disentangling the underlying factors hidden in the observable
data in representation form. The process of separating underlying factors of
variation into variables with semantic meaning benefits in learning explainable
representations of data, which imitates the meaningful understanding process of
humans when observing an object or relation. As a general learning strategy,
DRL has demonstrated its power in improving the model explainability,
controlability, robustness, as well as generalization capacity in a wide range
of scenarios such as computer vision, natural language processing, data mining
etc. In this article, we comprehensively review DRL from various aspects
including motivations, definitions, methodologies, evaluations, applications
and model designs. We discuss works on DRL based on two well-recognized
definitions, i.e., Intuitive Definition and Group Theory Definition. We further
categorize the methodologies for DRL into four groups, i.e., Traditional
Statistical Approaches, Variational Auto-encoder Based Approaches, Generative
Adversarial Networks Based Approaches, Hierarchical Approaches and Other
Approaches. We also analyze principles to design different DRL models that may
benefit different tasks in practical applications. Finally, we point out
challenges in DRL as well as potential research directions deserving future
investigations. We believe this work may provide insights for promoting the DRL
research in the community.Comment: 22 pages,9 figure
Disentangled Generative Causal Representation Learning
This paper proposes a Disentangled gEnerative cAusal Representation (DEAR)
learning method. Unlike existing disentanglement methods that enforce
independence of the latent variables, we consider the general case where the
underlying factors of interests can be causally correlated. We show that
previous methods with independent priors fail to disentangle causally
correlated factors. Motivated by this finding, we propose a new disentangled
learning method called DEAR that enables causal controllable generation and
causal representation learning. The key ingredient of this new formulation is
to use a structural causal model (SCM) as the prior for a bidirectional
generative model. The prior is then trained jointly with a generator and an
encoder using a suitable GAN loss incorporated with supervision. We provide
theoretical justification on the identifiability and asymptotic consistency of
the proposed method, which guarantees disentangled causal representation
learning under appropriate conditions. We conduct extensive experiments on both
synthesized and real data sets to demonstrate the effectiveness of DEAR in
causal controllable generation, and the benefits of the learned representations
for downstream tasks in terms of sample efficiency and distributional
robustness
μμ μ μμμ λν μ‘°κ±΄λΆ μμ±μ κ°μ μ κ΄ν μ°κ΅¬: νμκ³Ό ννμ μ€μ¬μΌλ‘
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(λμ§νΈμ 보μ΅ν©μ 곡), 2023. 2. μ΄κ΅κ΅¬.Conditional generation of musical components (CGMC) creates a part of music based on partial musical components such as melody or chord. CGMC is beneficial for discovering complex relationships among musical attributes. It can also assist non-experts who face difficulties in making music. However, recent studies for CGMC are still facing two challenges in terms of generation quality and model controllability. First, the structure of the generated music is not robust. Second, only limited ranges of musical factors and tasks have been examined as targets for flexible control of generation. In this thesis, we aim to mitigate these two challenges to improve the CGMC systems. For musical structure, we focus on intuitive modeling of musical hierarchy to help the model explicitly learn musically meaningful dependency. To this end, we utilize alignment paths between the raw music data and the musical units such as notes or chords. For musical creativity, we facilitate smooth control of novel musical attributes using latent representations. We attempt to achieve disentangled representations of the intended factors by regularizing them with data-driven inductive bias. This thesis verifies the proposed approaches particularly in two representative CGMC tasks, melody harmonization and expressive performance rendering. A variety of experimental results show the possibility of the proposed approaches to expand musical creativity under stable generation quality.μμ
μ μμλ₯Ό μ‘°κ±΄λΆ μμ±νλ λΆμΌμΈ CGMCλ λ©λ‘λλ νμκ³Ό κ°μ μμ
μ μΌλΆλΆμ κΈ°λ°μΌλ‘ λλ¨Έμ§ λΆλΆμ μμ±νλ κ²μ λͺ©νλ‘ νλ€. μ΄ λΆμΌλ μμ
μ μμ κ° λ³΅μ‘ν κ΄κ³λ₯Ό νꡬνλ λ° μ©μ΄νκ³ , μμ
μ λ§λλ λ° μ΄λ €μμ κ²ͺλ λΉμ λ¬Έκ°λ€μ λμΈ μ μλ€. μ΅κ·Ό μ°κ΅¬λ€μ λ₯λ¬λ κΈ°μ μ νμ©νμ¬ CGMC μμ€ν
μ μ±λ₯μ λμ¬μλ€. νμ§λ§, μ΄λ¬ν μ°κ΅¬λ€μλ μμ§ μμ± νμ§κ³Ό μ μ΄κ°λ₯μ± μΈ‘λ©΄μμ λ κ°μ§μ νκ³μ μ΄ μλ€. λ¨Όμ , μμ±λ μμ
μ μμ
μ κ΅¬μ‘°κ° λͺ
ννμ§ μλ€. λν, μμ§ μ’μ λ²μμ μμ
μ μμ λ° ν
μ€ν¬λ§μ΄ μ μ°ν μ μ΄μ λμμΌλ‘μ νꡬλμλ€. μ΄μ λ³Έ νμλ
Όλ¬Έμμλ CGMCμ κ°μ μ μν΄ μ λ κ°μ§μ νκ³μ μ ν΄κ²°νκ³ μ νλ€. 첫 λ²μ§Έλ‘, μμ
ꡬ쑰λ₯Ό μ΄λ£¨λ μμ
μ μκ³λ₯Ό μ§κ΄μ μΌλ‘ λͺ¨λΈλ§νλ λ° μ§μ€νκ³ μ νλ€. λ³Έλ λ°μ΄ν°μ μ, νμκ³Ό κ°μ μμ
μ λ¨μ κ° μ λ ¬ κ²½λ‘λ₯Ό μ¬μ©νμ¬ λͺ¨λΈμ΄ μμ
μ μΌλ‘ μλ―Έμλ μ’
μμ±μ λͺ
ννκ² λ°°μΈ μ μλλ‘ νλ€. λ λ²μ§Έλ‘, μ μ¬ νμμ νμ©νμ¬ μλ‘μ΄ μμ
μ μμλ€μ μ μ°νκ² μ μ΄νκ³ μ νλ€. νΉν μ μ¬ νμμ΄ μλλ μμμ λν΄ ν리λλ‘ νλ ¨νκΈ° μν΄μ λΉμ§λ νΉμ μκ°μ§λ νμ΅ νλ μμν¬μ μ¬μ©νμ¬ μ μ¬ νμμ μ ννλλ‘ νλ€. λ³Έ νμλ
Όλ¬Έμμλ CGMC λΆμΌμ λνμ μΈ λ ν
μ€ν¬μΈ λ©λ‘λ νλͺ¨λμ΄μ μ΄μ
λ° ννμ μ°μ£Ό λ λλ§ ν
μ€ν¬μ λν΄ μμ λ κ°μ§ λ°©λ²λ‘ μ κ²μ¦νλ€. λ€μν μ€νμ κ²°κ³Όλ€μ ν΅ν΄ μ μν λ°©λ²λ‘ μ΄ CGMC μμ€ν
μ μμ
μ μ°½μμ±μ μμ μ μΈ μμ± νμ§λ‘ νμ₯ν μ μλ€λ κ°λ₯μ±μ μμ¬νλ€.Chapter 1 Introduction 1
1.1 Motivation 5
1.2 Definitions 8
1.3 Tasks of Interest 10
1.3.1 Generation Quality 10
1.3.2 Controllability 12
1.4 Approaches 13
1.4.1 Modeling Musical Hierarchy 14
1.4.2 Regularizing Latent Representations 16
1.4.3 Target Tasks 18
1.5 Outline of the Thesis 19
Chapter 2 Background 22
2.1 Music Generation Tasks 23
2.1.1 Melody Harmonization 23
2.1.2 Expressive Performance Rendering 25
2.2 Structure-enhanced Music Generation 27
2.2.1 Hierarchical Music Generation 27
2.2.2 Transformer-based Music Generation 28
2.3 Disentanglement Learning 29
2.3.1 Unsupervised Approaches 30
2.3.2 Supervised Approaches 30
2.3.3 Self-supervised Approaches 31
2.4 Controllable Music Generation 32
2.4.1 Score Generation 32
2.4.2 Performance Rendering 33
2.5 Summary 34
Chapter 3 Translating Melody to Chord: Structured and Flexible Harmonization of Melody with Transformer 36
3.1 Introduction 36
3.2 Proposed Methods 41
3.2.1 Standard Transformer Model (STHarm) 41
3.2.2 Variational Transformer Model (VTHarm) 44
3.2.3 Regularized Variational Transformer Model (rVTHarm) 46
3.2.4 Training Objectives 47
3.3 Experimental Settings 48
3.3.1 Datasets 49
3.3.2 Comparative Methods 50
3.3.3 Training 50
3.3.4 Metrics 51
3.4 Evaluation 56
3.4.1 Chord Coherence and Diversity 57
3.4.2 Harmonic Similarity to Human 59
3.4.3 Controlling Chord Complexity 60
3.4.4 Subjective Evaluation 62
3.4.5 Qualitative Results 67
3.4.6 Ablation Study 73
3.5 Conclusion and Future Work 74
Chapter 4 Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-supervised Learning 76
4.1 Introduction 76
4.2 Proposed Methods 79
4.2.1 Data Representation 79
4.2.2 Modeling Musical Hierarchy 80
4.2.3 Overall Network Architecture 81
4.2.4 Regularizing the Latent Variables 84
4.2.5 Overall Objective 86
4.3 Experimental Settings 87
4.3.1 Dataset and Implementation 87
4.3.2 Comparative Methods 88
4.4 Evaluation 88
4.4.1 Generation Quality 89
4.4.2 Disentangling Latent Representations 90
4.4.3 Controllability of Expressive Attributes 91
4.4.4 KL Divergence 93
4.4.5 Ablation Study 94
4.4.6 Subjective Evaluation 95
4.4.7 Qualitative Examples 97
4.4.8 Extent of Control 100
4.5 Conclusion 102
Chapter 5 Conclusion and Future Work 103
5.1 Conclusion 103
5.2 Future Work 106
5.2.1 Deeper Investigation of Controllable Factors 106
5.2.2 More Analysis of Qualitative Evaluation Results 107
5.2.3 Improving Diversity and Scale of Dataset 108
Bibliography 109
μ΄ λ‘ 137λ°
LatentKeypointGAN: Controlling GANs via Latent Keypoints
Generative adversarial networks (GANs) have attained photo-realistic quality
in image generation. However, how to best control the image content remains an
open challenge. We introduce LatentKeypointGAN, a two-stage GAN which is
trained end-to-end on the classical GAN objective with internal conditioning on
a set of space keypoints. These keypoints have associated appearance embeddings
that respectively control the position and style of the generated objects and
their parts. A major difficulty that we address with suitable network
architectures and training schemes is disentangling the image into spatial and
appearance factors without domain knowledge and supervision signals. We
demonstrate that LatentKeypointGAN provides an interpretable latent space that
can be used to re-arrange the generated images by re-positioning and exchanging
keypoint embeddings, such as generating portraits by combining the eyes, nose,
and mouth from different images. In addition, the explicit generation of
keypoints and matching images enables a new, GAN-based method for unsupervised
keypoint detection
-Flow: Joint Semantic and Style Editing of Facial Images
The high-quality images yielded by generative adversarial networks (GANs)
have motivated investigations into their application for image editing.
However, GANs are often limited in the control they provide for performing
specific edits. One of the principal challenges is the entangled latent space
of GANs, which is not directly suitable for performing independent and detailed
edits. Recent editing methods allow for either controlled style edits or
controlled semantic edits. In addition, methods that use semantic masks to edit
images have difficulty preserving the identity and are unable to perform
controlled style edits. We propose a method to disentangle a GANs
latent space into semantic and style spaces, enabling controlled semantic and
style edits for face images independently within the same framework. To achieve
this, we design an encoder-decoder based network architecture (-Flow),
which incorporates two proposed inductive biases. We show the suitability of
-Flow quantitatively and qualitatively by performing various semantic and
style edits.Comment: Accepted to BMVC 202
Hierarchically Organized Latent Modules for Exploratory Search in Morphogenetic Systems
Self-organization of complex morphological patterns from local interactions
is a fascinating phenomenon in many natural and artificial systems. In the
artificial world, typical examples of such morphogenetic systems are cellular
automata. Yet, their mechanisms are often very hard to grasp and so far
scientific discoveries of novel patterns have primarily been relying on manual
tuning and ad hoc exploratory search. The problem of automated diversity-driven
discovery in these systems was recently introduced [26, 62], highlighting that
two key ingredients are autonomous exploration and unsupervised representation
learning to describe "relevant" degrees of variations in the patterns. In this
paper, we motivate the need for what we call Meta-diversity search, arguing
that there is not a unique ground truth interesting diversity as it strongly
depends on the final observer and its motives. Using a continuous game-of-life
system for experiments, we provide empirical evidences that relying on
monolithic architectures for the behavioral embedding design tends to bias the
final discoveries (both for hand-defined and unsupervisedly-learned features)
which are unlikely to be aligned with the interest of a final end-user. To
address these issues, we introduce a novel dynamic and modular architecture
that enables unsupervised learning of a hierarchy of diverse representations.
Combined with intrinsically motivated goal exploration algorithms, we show that
this system forms a discovery assistant that can efficiently adapt its
diversity search towards preferences of a user using only a very small amount
of user feedback
BlobGAN: Spatially Disentangled Scene Representations
We propose an unsupervised, mid-level representation for a generative model
of scenes. The representation is mid-level in that it is neither per-pixel nor
per-image; rather, scenes are modeled as a collection of spatial, depth-ordered
"blobs" of features. Blobs are differentiably placed onto a feature grid that
is decoded into an image by a generative adversarial network. Due to the
spatial uniformity of blobs and the locality inherent to convolution, our
network learns to associate different blobs with different entities in a scene
and to arrange these blobs to capture scene layout. We demonstrate this
emergent behavior by showing that, despite training without any supervision,
our method enables applications such as easy manipulation of objects within a
scene (e.g., moving, removing, and restyling furniture), creation of feasible
scenes given constraints (e.g., plausible rooms with drawers at a particular
location), and parsing of real-world images into constituent parts. On a
challenging multi-category dataset of indoor scenes, BlobGAN outperforms
StyleGAN2 in image quality as measured by FID. See our project page for video
results and interactive demo: https://www.dave.ml/blobganComment: ECCV 2022. Project webpage available at https://www.dave.ml/blobga
Learning Disentangled Representations
Artificial intelligence systems are seeking to learn better representations. One of the most desirable properties in these representations is disentanglement. Disentangled representations show merits of interpretability and generalizability. Through these representations, the world around us can be decomposed into explanatory factors of variation, and can thus be more easily understood by not only machines but humans. Disentanglement is akin to the reverse engineering process of a video game, where based on exploring the beautiful open world we need to figure out what underlying controllable factors that actually render/generate these fantastic dynamics. This thesis mainly discusses the problem of how such "reverse engineering" can be achieved using deep learning techniques in the computer vision domain. Although there have been plenty of works tackling this challenging problem, this thesis shows that an important ingredient that is highly effective but largely neglected by existing works is the modeling of visual variation. We show from various perspectives that by integrating the modeling of visual variation in generative models, we can achieve superior unsupervised disentanglement performance that has never been seen before. Specifically, this thesis will cover various novel methods based on technical insights such as variation consistency, variation predictability, perceptual simplicity, spatial constriction, Lie group decomposition, and contrastive nature in semantic changes. Besides the proposed methods, this thesis also touches on topics such as variational autoencoders, generative adversarial networks, latent space examination, unsupervised disentanglement metrics, and neural network architectures. We hope the observations, analysis, and methods presented in this thesis can inspire and contribute to future works in disentanglement learning and related machine learning fields
- β¦