Search CORE

85 research outputs found

Improving Deep Representation Learning with Complex and Multimodal Data.

Author: Sohn Kihyuk
Publication venue
Publication date: 01/01/2015
Field of study

Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts. The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation. The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd

Deep Blue Documents at the University of Michigan

JOINT CODING OF MULTIMODAL BIOMEDICAL IMAGES US ING CONVOLUTIONAL NEURAL NETWORKS

Author: Parracho João Oliveira
Publication venue
Publication date: 08/11/2020
Field of study

The massive volume of data generated daily by the gathering of medical images with different modalities might be difficult to store in medical facilities and share through communication networks. To alleviate this issue, efficient compression methods must be implemented to reduce the amount of storage and transmission resources required in such applications. However, since the preservation of all image details is highly important in the medical context, the use of lossless image compression algorithms is of utmost importance. This thesis presents the research results on a lossless compression scheme designed to encode both computerized tomography (CT) and positron emission tomography (PET). Different techniques, such as image-to-image translation, intra prediction, and inter prediction are used. Redundancies between both image modalities are also investigated. To perform the image-to-image translation approach, we resort to lossless compression of the original CT data and apply a cross-modality image translation generative adversarial network to obtain an estimation of the corresponding PET. Two approaches were implemented and evaluated to determine a PET residue that will be compressed along with the original CT. In the first method, the residue resulting from the differences between the original PET and its estimation is encoded, whereas in the second method, the residue is obtained using encoders inter-prediction coding tools. Thus, in alternative to compressing two independent picture modalities, i.e., both images of the original PET-CT pair solely the CT is independently encoded alongside with the PET residue, in the proposed method. Along with the proposed pipeline, a post-processing optimization algorithm that modifies the estimated PET image by altering the contrast and rescaling the image is implemented to maximize the compression efficiency. Four different versions (subsets) of a publicly available PET-CT pair dataset were tested. The first proposed subset was used to demonstrate that the concept developed in this work is capable of surpassing the traditional compression schemes. The obtained results showed gains of up to 8.9% using the HEVC. On the other side, JPEG2k proved not to be the most suitable as it failed to obtain good results, having reached only -9.1% compression gain. For the remaining (more challenging) subsets, the results reveal that the proposed refined post-processing scheme attains, when compared to conventional compression methods, up 6.33% compression gain using HEVC, and 7.78% using VVC

IC-online

Autoencoding sensory substitution

Author: Tóth Viktor
Publication venue
Publication date: 17/06/2019
Field of study

Tens of millions of people live blind, and their number is ever increasing. Visual-to-auditory sensory substitution (SS) encompasses a family of cheap, generic solutions to assist the visually impaired by conveying visual information through sound. The required SS training is lengthy: months of effort is necessary to reach a practical level of adaptation. There are two reasons for the tedious training process: the elongated substituting audio signal, and the disregard for the compressive characteristics of the human hearing system. To overcome these obstacles, we developed a novel class of SS methods, by training deep recurrent autoencoders for image-to-sound conversion. We successfully trained deep learning models on different datasets to execute visual-to-auditory stimulus conversion. By constraining the visual space, we demonstrated the viability of shortened substituting audio signals, while proposing mechanisms, such as the integration of computational hearing models, to optimally convey visual features in the substituting stimulus as perceptually discernible auditory components. We tested our approach in two separate cases. In the first experiment, the author went blindfolded for 5 days, while performing SS training on hand posture discrimination. The second experiment assessed the accuracy of reaching movements towards objects on a table. In both test cases, above-chance-level accuracy was attained after a few hours of training. Our novel SS architecture broadens the horizon of rehabilitation methods engineered for the visually impaired. Further improvements on the proposed model shall yield hastened rehabilitation of the blind and a wider adaptation of SS devices as a consequence

Aaltodoc Publication Archive

Lung computed tomography image synthesis using generative adversarial networks

Author: José Miguel Ferreira Mendes
Publication venue
Publication date: 20/07/2020
Field of study

Repositório Aberto da Universidade do Porto

신체 임베딩을 활용한 오토인코더 기반 컴퓨터 비전 모형의 성능 개선

Author: 박종혁
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2021.8. 박종헌.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways. This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation. Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition. When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.컴퓨터 비전은 딥러닝 학습 방법론이 강점을 보이는 분야로, 다양한 태스크에서 우수한 성능을 보이고 있다. 특히, 사람이 포함된 이미지나 동영상을 딥러닝을 통해 분석하는 태스크의 경우, 최근 소셜 미디어에 사람이 포함된 이미지 또는 동영상 게시물이 늘어나면서 그 활용 가치가 높아지고 있다. 본 논문에서는 사람과 관련된 컴퓨터 비전 태스크 중 패션 스타일 분류 문제와 동작 유사도 측정에 대해 다룬다. 패션 스타일 분류 문제의 경우, 데이터 수집 시점의 패션 유행에 따라 스타일 클래스별 수집되는 샘플의 양이 달라지기 때문에 이로부터 클래스 불균형이 발생한다. 본 논문에서는 이러한 클래스 불균형 문제에 대처하기 위하여, 소수 샘플 클래스와 다수 샘플 클래스를 학습 및 평가에 모두 사용하는 일반화된 퓨샷러닝으로 패션 스타일 분류 문제를 설정하였다. 또한 변분 오토인코더 기반의 모델을 통해, 신체 및 패션 아이템 부분만 잘라낸 전경 이미지 모달리티와 패션 속성 정보 모달리티가 패션 이미지의 임베딩 학습에 반영되도록 하였다. 학습 및 평가를 위한 데이터셋으로는 한국 패션 쇼핑몰에서 수집된 K-fashion 데이터셋을 사용하였다. 한편, 동작 유사도 측정은 행위 인식, 이상 동작 감지, 사람 재인식 같은 다양한 분야의 하위 모듈로 활용되고 있지만 그 자체가 연구된 적은 많지 않은데, 이는 같은 동작을 수행하더라도 신체 구조 및 카메라 각도에 따라 다르게 표현될 수 있다는 점으로 부터 기인한다. 학습 및 평가를 위한 공개 데이터셋이 많지 않다는 점 또한 연구를 어렵게 하는 요인이다. 따라서 본 논문에서는 학습을 위한 인공 데이터셋을 수집하여 오토인코더 구조를 통해 신체 구조 및 카메라 각도 요소가 분리된 동작 임베딩을 학습하였다. 이때, 각 신체 부위별로 동작 임베딩을 생성할 수 있도록하여 신체 부위별로 동작 유사도 측정이 가능하도록 하였다. 두 동작 사이의 유사도를 측정할 때에는 동적 시간 워핑 기법을 사용, 비슷한 동작을 수행하는 구간끼리 정렬시켜 유사도를 측정하도록 함으로써, 동작 수행 속도의 차이를 보정하였다. 평가를 위한 유사도 점수 데이터셋은 행위 인식 데이터셋인 NTU-RGB+D 120의 영상을 활용하여 크라우드 소싱 플랫폼을 통해 수집되었다. 두 가지 태스크의 제안 모델을 각각의 평가 데이터셋으로 검증한 결과, 모두 비교 모델 대비 우수한 성능을 기록하였다. 패션 스타일 분류 문제의 경우, 모든 비교군에서 소수 샘플 클래스와 다수 샘플 클래스 중 한 쪽으로 치우치지 않는 가장 균형잡힌 추론 성능을 보여주었고, 동작 유사도 측정의 경우 사람이 측정한 유사도 점수와 상관계수에서 비교 모델 대비 더 높은 수치를 나타내었다.Chapter 1 Introduction 1 1.1 Background and motivation 1 1.2 Research contribution 5 1.2.1 Fashion style classication 5 1.2.2 Human motion similarity 9 1.2.3 Summary of the contributions 11 1.3 Thesis outline 13 Chapter 2 Literature Review 14 2.1 Fashion style classication 14 2.1.1 Machine learning and deep learning-based approaches 14 2.1.2 Class imbalance 15 2.1.3 Variational autoencoder 17 2.2 Human motion similarity 19 2.2.1 Measuring the similarity between two people 19 2.2.2 Human body embedding 20 2.2.3 Datasets for measuring the similarity 20 2.2.4 Triplet and quadruplet losses 21 2.2.5 Dynamic time warping 22 Chapter 3 Fashion Style Classication 24 3.1 Dataset for fashion style classication: K-fashion 24 3.2 Multimodal variational inference for fashion style classication 28 3.2.1 CADA-VAE 31 3.2.2 Generating multimodal features 33 3.2.3 Classier training with cyclic oversampling 36 3.3 Experimental results for fashion style classication 38 3.3.1 Implementation details 38 3.3.2 Settings for experiments 42 3.3.3 Experimental results on K-fashion 44 3.3.4 Qualitative analysis 48 3.3.5 Eectiveness of the cyclic oversampling 50 Chapter 4 Motion Similarity Measurement 53 4.1 Datasets for motion similarity 53 4.1.1 Synthetic motion dataset: SARA dataset 53 4.1.2 NTU RGB+D 120 similarity annotations 55 4.2 Framework for measuring motion similarity 58 4.2.1 Body part embedding model 58 4.2.2 Measuring motion similarity 67 4.3 Experimental results for measuring motion similarity 68 4.3.1 Implementation details 68 4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72 4.3.3 Visualization of motion latent clusters 78 4.4 Application 81 4.4.1 Real-world application with dancing videos 81 4.4.2 Tuning similarity scores to match human perception 87 Chapter 5 Conclusions 89 5.1 Summary and contributions 89 5.2 Limitations and future research 91 Appendices 93 Chapter A NTU RGB+D 120 Similarity Annotations 94 A.1 Data collection 94 A.2 AMT score analysis 95 Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100 Chapter C Motion Sequence Generation Using Mixamo 102 Bibliography 104 국문초록 123박

SNU Open Repository and Archive

Exploring variability in medical imaging

Author: Chotzoglou Elissavet
Publication venue: Computing, Imperial College London
Publication date: 01/04/2022
Field of study

Although recent successes of deep learning and novel machine learning techniques improved the perfor- mance of classification and (anomaly) detection in computer vision problems, the application of these methods in medical imaging pipeline remains a very challenging task. One of the main reasons for this is the amount of variability that is encountered and encapsulated in human anatomy and subsequently reflected in medical images. This fundamental factor impacts most stages in modern medical imaging processing pipelines. Variability of human anatomy makes it virtually impossible to build large datasets for each disease with labels and annotation for fully supervised machine learning. An efficient way to cope with this is to try and learn only from normal samples. Such data is much easier to collect. A case study of such an automatic anomaly detection system based on normative learning is presented in this work. We present a framework for detecting fetal cardiac anomalies during ultrasound screening using generative models, which are trained only utilising normal/healthy subjects. However, despite the significant improvement in automatic abnormality detection systems, clinical routine continues to rely exclusively on the contribution of overburdened medical experts to diagnosis and localise abnormalities. Integrating human expert knowledge into the medical imaging processing pipeline entails uncertainty which is mainly correlated with inter-observer variability. From the per- spective of building an automated medical imaging system, it is still an open issue, to what extent this kind of variability and the resulting uncertainty are introduced during the training of a model and how it affects the final performance of the task. Consequently, it is very important to explore the effect of inter-observer variability both, on the reliable estimation of model’s uncertainty, as well as on the model’s performance in a specific machine learning task. A thorough investigation of this issue is presented in this work by leveraging automated estimates for machine learning model uncertainty, inter-observer variability and segmentation task performance in lung CT scan images. Finally, a presentation of an overview of the existing anomaly detection methods in medical imaging was attempted. This state-of-the-art survey includes both conventional pattern recognition methods and deep learning based methods. It is one of the first literature surveys attempted in the specific research area.Open Acces

Spiral - Imperial College Digital Repository