85 research outputs found
Improving Deep Representation Learning with Complex and Multimodal Data.
Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts.
The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation.
The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd
JOINT CODING OF MULTIMODAL BIOMEDICAL IMAGES US ING CONVOLUTIONAL NEURAL NETWORKS
The massive volume of data generated daily by the gathering of medical images with
different modalities might be difficult to store in medical facilities and share through
communication networks. To alleviate this issue, efficient compression methods
must be implemented to reduce the amount of storage and transmission resources
required in such applications. However, since the preservation of all image details
is highly important in the medical context, the use of lossless image compression
algorithms is of utmost importance.
This thesis presents the research results on a lossless compression scheme designed
to encode both computerized tomography (CT) and positron emission tomography
(PET). Different techniques, such as image-to-image translation, intra prediction,
and inter prediction are used. Redundancies between both image modalities are
also investigated. To perform the image-to-image translation approach, we resort to
lossless compression of the original CT data and apply a cross-modality image translation
generative adversarial network to obtain an estimation of the corresponding
PET.
Two approaches were implemented and evaluated to determine a PET residue
that will be compressed along with the original CT. In the first method, the
residue resulting from the differences between the original PET and its estimation
is encoded, whereas in the second method, the residue is obtained using encoders
inter-prediction coding tools. Thus, in alternative to compressing two independent
picture modalities, i.e., both images of the original PET-CT pair solely the CT is
independently encoded alongside with the PET residue, in the proposed method.
Along with the proposed pipeline, a post-processing optimization algorithm that
modifies the estimated PET image by altering the contrast and rescaling the image
is implemented to maximize the compression efficiency.
Four different versions (subsets) of a publicly available PET-CT pair dataset
were tested. The first proposed subset was used to demonstrate that the concept
developed in this work is capable of surpassing the traditional compression schemes.
The obtained results showed gains of up to 8.9% using the HEVC. On the other
side, JPEG2k proved not to be the most suitable as it failed to obtain good results,
having reached only -9.1% compression gain. For the remaining (more challenging) subsets, the results reveal that the proposed refined post-processing scheme attains,
when compared to conventional compression methods, up 6.33% compression gain
using HEVC, and 7.78% using VVC
Autoencoding sensory substitution
Tens of millions of people live blind, and their number is ever increasing. Visual-to-auditory sensory substitution (SS) encompasses a family of cheap, generic solutions to assist the visually impaired by conveying visual information through sound. The required SS training is lengthy: months of effort is necessary to reach a practical level of adaptation. There are two reasons for the tedious training process: the elongated substituting audio signal, and the disregard for the compressive characteristics of the human hearing system.
To overcome these obstacles, we developed a novel class of SS methods, by training deep recurrent autoencoders for image-to-sound conversion. We successfully trained deep learning models on different datasets to execute visual-to-auditory stimulus conversion. By constraining the visual space, we demonstrated the viability of shortened substituting audio signals, while proposing mechanisms, such as the integration of computational hearing models, to optimally convey visual features in the substituting stimulus as perceptually discernible auditory components. We tested our approach in two separate cases. In the first experiment, the author went blindfolded for 5 days, while performing SS training on hand posture discrimination. The second experiment assessed the accuracy of reaching movements towards objects on a table. In both test cases, above-chance-level accuracy was attained after a few hours of training.
Our novel SS architecture broadens the horizon of rehabilitation methods engineered for the visually impaired. Further improvements on the proposed model shall yield hastened rehabilitation of the blind and a wider adaptation of SS devices as a consequence
μ 체 μλ² λ©μ νμ©ν μ€ν μΈμ½λ κΈ°λ° μ»΄ν¨ν° λΉμ λͺ¨νμ μ±λ₯ κ°μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ°μ
곡νκ³Ό, 2021.8. λ°μ’
ν.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways.
This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation.
Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition.
When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.μ»΄ν¨ν° λΉμ μ λ₯λ¬λ νμ΅ λ°©λ²λ‘ μ΄ κ°μ μ 보μ΄λ λΆμΌλ‘, λ€μν νμ€ν¬μμ μ°μν μ±λ₯μ 보μ΄κ³ μλ€. νΉν, μ¬λμ΄ ν¬ν¨λ μ΄λ―Έμ§λ λμμμ λ₯λ¬λμ ν΅ν΄ λΆμνλ νμ€ν¬μ κ²½μ°, μ΅κ·Ό μμ
λ―Έλμ΄μ μ¬λμ΄ ν¬ν¨λ μ΄λ―Έμ§ λλ λμμ κ²μλ¬Όμ΄ λμ΄λλ©΄μ κ·Έ νμ© κ°μΉκ° λμμ§κ³ μλ€.
λ³Έ λ
Όλ¬Έμμλ μ¬λκ³Ό κ΄λ ¨λ μ»΄ν¨ν° λΉμ νμ€ν¬ μ€ ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ λμ μ μ¬λ μΈ‘μ μ λν΄ λ€λ£¬λ€. ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ κ²½μ°, λ°μ΄ν° μμ§ μμ μ ν¨μ
μ νμ λ°λΌ μ€νμΌ ν΄λμ€λ³ μμ§λλ μνμ μμ΄ λ¬λΌμ§κΈ° λλ¬Έμ μ΄λ‘λΆν° ν΄λμ€ λΆκ· νμ΄ λ°μνλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν ν΄λμ€ λΆκ· ν λ¬Έμ μ λμ²νκΈ° μνμ¬, μμ μν ν΄λμ€μ λ€μ μν ν΄λμ€λ₯Ό νμ΅ λ° νκ°μ λͺ¨λ μ¬μ©νλ μΌλ°νλ ν¨μ·λ¬λμΌλ‘ ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ λ₯Ό μ€μ νμλ€. λν λ³λΆ μ€ν μΈμ½λ κΈ°λ°μ λͺ¨λΈμ ν΅ν΄, μ 체 λ° ν¨μ
μμ΄ν
λΆλΆλ§ μλΌλΈ μ κ²½ μ΄λ―Έμ§ λͺ¨λ¬λ¦¬ν°μ ν¨μ
μμ± μ 보 λͺ¨λ¬λ¦¬ν°κ° ν¨μ
μ΄λ―Έμ§μ μλ² λ© νμ΅μ λ°μλλλ‘ νμλ€. νμ΅ λ° νκ°λ₯Ό μν λ°μ΄ν°μ
μΌλ‘λ νκ΅ ν¨μ
μΌνλͺ°μμ μμ§λ K-fashion λ°μ΄ν°μ
μ μ¬μ©νμλ€.
ννΈ, λμ μ μ¬λ μΈ‘μ μ νμ μΈμ, μ΄μ λμ κ°μ§, μ¬λ μ¬μΈμ κ°μ λ€μν λΆμΌμ νμ λͺ¨λλ‘ νμ©λκ³ μμ§λ§ κ·Έ μμ²΄κ° μ°κ΅¬λ μ μ λ§μ§ μμλ°, μ΄λ κ°μ λμμ μννλλΌλ μ 체 ꡬ쑰 λ° μΉ΄λ©λΌ κ°λμ λ°λΌ λ€λ₯΄κ² ννλ μ μλ€λ μ μΌλ‘ λΆν° κΈ°μΈνλ€. νμ΅ λ° νκ°λ₯Ό μν κ³΅κ° λ°μ΄ν°μ
μ΄ λ§μ§ μλ€λ μ λν μ°κ΅¬λ₯Ό μ΄λ ΅κ² νλ μμΈμ΄λ€. λ°λΌμ λ³Έ λ
Όλ¬Έμμλ νμ΅μ μν μΈκ³΅ λ°μ΄ν°μ
μ μμ§νμ¬ μ€ν μΈμ½λ ꡬ쑰λ₯Ό ν΅ν΄ μ 체 ꡬ쑰 λ° μΉ΄λ©λΌ κ°λ μμκ° λΆλ¦¬λ λμ μλ² λ©μ νμ΅νμλ€. μ΄λ, κ° μ 체 λΆμλ³λ‘ λμ μλ² λ©μ μμ±ν μ μλλ‘νμ¬ μ 체 λΆμλ³λ‘ λμ μ μ¬λ μΈ‘μ μ΄ κ°λ₯νλλ‘ νμλ€. λ λμ μ¬μ΄μ μ μ¬λλ₯Ό μΈ‘μ ν λμλ λμ μκ° μν κΈ°λ²μ μ¬μ©, λΉμ·ν λμμ μννλ ꡬκ°λΌλ¦¬ μ λ ¬μμΌ μ μ¬λλ₯Ό μΈ‘μ νλλ‘ ν¨μΌλ‘μ¨, λμ μν μλμ μ°¨μ΄λ₯Ό 보μ νμλ€. νκ°λ₯Ό μν μ μ¬λ μ μ λ°μ΄ν°μ
μ νμ μΈμ λ°μ΄ν°μ
μΈ NTU-RGB+D 120μ μμμ νμ©νμ¬ ν¬λΌμ°λ μμ± νλ«νΌμ ν΅ν΄ μμ§λμλ€.
λ κ°μ§ νμ€ν¬μ μ μ λͺ¨λΈμ κ°κ°μ νκ° λ°μ΄ν°μ
μΌλ‘ κ²μ¦ν κ²°κ³Ό, λͺ¨λ λΉκ΅ λͺ¨λΈ λλΉ μ°μν μ±λ₯μ κΈ°λ‘νμλ€. ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ κ²½μ°, λͺ¨λ λΉκ΅κ΅°μμ μμ μν ν΄λμ€μ λ€μ μν ν΄λμ€ μ€ ν μͺ½μΌλ‘ μΉμ°μΉμ§ μλ κ°μ₯ κ· νμ‘ν μΆλ‘ μ±λ₯μ 보μ¬μ£Όμκ³ , λμ μ μ¬λ μΈ‘μ μ κ²½μ° μ¬λμ΄ μΈ‘μ ν μ μ¬λ μ μμ μκ΄κ³μμμ λΉκ΅ λͺ¨λΈ λλΉ λ λμ μμΉλ₯Ό λνλ΄μλ€.Chapter 1 Introduction 1
1.1 Background and motivation 1
1.2 Research contribution 5
1.2.1 Fashion style classication 5
1.2.2 Human motion similarity 9
1.2.3 Summary of the contributions 11
1.3 Thesis outline 13
Chapter 2 Literature Review 14
2.1 Fashion style classication 14
2.1.1 Machine learning and deep learning-based approaches 14
2.1.2 Class imbalance 15
2.1.3 Variational autoencoder 17
2.2 Human motion similarity 19
2.2.1 Measuring the similarity between two people 19
2.2.2 Human body embedding 20
2.2.3 Datasets for measuring the similarity 20
2.2.4 Triplet and quadruplet losses 21
2.2.5 Dynamic time warping 22
Chapter 3 Fashion Style Classication 24
3.1 Dataset for fashion style classication: K-fashion 24
3.2 Multimodal variational inference for fashion style classication 28
3.2.1 CADA-VAE 31
3.2.2 Generating multimodal features 33
3.2.3 Classier training with cyclic oversampling 36
3.3 Experimental results for fashion style classication 38
3.3.1 Implementation details 38
3.3.2 Settings for experiments 42
3.3.3 Experimental results on K-fashion 44
3.3.4 Qualitative analysis 48
3.3.5 Eectiveness of the cyclic oversampling 50
Chapter 4 Motion Similarity Measurement 53
4.1 Datasets for motion similarity 53
4.1.1 Synthetic motion dataset: SARA dataset 53
4.1.2 NTU RGB+D 120 similarity annotations 55
4.2 Framework for measuring motion similarity 58
4.2.1 Body part embedding model 58
4.2.2 Measuring motion similarity 67
4.3 Experimental results for measuring motion similarity 68
4.3.1 Implementation details 68
4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72
4.3.3 Visualization of motion latent clusters 78
4.4 Application 81
4.4.1 Real-world application with dancing videos 81
4.4.2 Tuning similarity scores to match human perception 87
Chapter 5 Conclusions 89
5.1 Summary and contributions 89
5.2 Limitations and future research 91
Appendices 93
Chapter A NTU RGB+D 120 Similarity Annotations 94
A.1 Data collection 94
A.2 AMT score analysis 95
Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100
Chapter C Motion Sequence Generation Using Mixamo 102
Bibliography 104
κ΅λ¬Έμ΄λ‘ 123λ°
Exploring variability in medical imaging
Although recent successes of deep learning and novel machine learning techniques improved the perfor-
mance of classification and (anomaly) detection in computer vision problems, the application of these
methods in medical imaging pipeline remains a very challenging task. One of the main reasons for this
is the amount of variability that is encountered and encapsulated in human anatomy and subsequently
reflected in medical images. This fundamental factor impacts most stages in modern medical imaging
processing pipelines.
Variability of human anatomy makes it virtually impossible to build large datasets for each disease
with labels and annotation for fully supervised machine learning. An efficient way to cope with this is
to try and learn only from normal samples. Such data is much easier to collect. A case study of such
an automatic anomaly detection system based on normative learning is presented in this work. We
present a framework for detecting fetal cardiac anomalies during ultrasound screening using generative
models, which are trained only utilising normal/healthy subjects.
However, despite the significant improvement in automatic abnormality detection systems, clinical
routine continues to rely exclusively on the contribution of overburdened medical experts to diagnosis
and localise abnormalities. Integrating human expert knowledge into the medical imaging processing
pipeline entails uncertainty which is mainly correlated with inter-observer variability. From the per-
spective of building an automated medical imaging system, it is still an open issue, to what extent
this kind of variability and the resulting uncertainty are introduced during the training of a model
and how it affects the final performance of the task. Consequently, it is very important to explore the
effect of inter-observer variability both, on the reliable estimation of modelβs uncertainty, as well as
on the modelβs performance in a specific machine learning task. A thorough investigation of this issue
is presented in this work by leveraging automated estimates for machine learning model uncertainty,
inter-observer variability and segmentation task performance in lung CT scan images.
Finally, a presentation of an overview of the existing anomaly detection methods in medical imaging
was attempted. This state-of-the-art survey includes both conventional pattern recognition methods
and deep learning based methods. It is one of the first literature surveys attempted in the specific
research area.Open Acces
- β¦