85 research outputs found

    Improving Deep Representation Learning with Complex and Multimodal Data.

    Full text link
    Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts. The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation. The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd

    JOINT CODING OF MULTIMODAL BIOMEDICAL IMAGES US ING CONVOLUTIONAL NEURAL NETWORKS

    Get PDF
    The massive volume of data generated daily by the gathering of medical images with different modalities might be difficult to store in medical facilities and share through communication networks. To alleviate this issue, efficient compression methods must be implemented to reduce the amount of storage and transmission resources required in such applications. However, since the preservation of all image details is highly important in the medical context, the use of lossless image compression algorithms is of utmost importance. This thesis presents the research results on a lossless compression scheme designed to encode both computerized tomography (CT) and positron emission tomography (PET). Different techniques, such as image-to-image translation, intra prediction, and inter prediction are used. Redundancies between both image modalities are also investigated. To perform the image-to-image translation approach, we resort to lossless compression of the original CT data and apply a cross-modality image translation generative adversarial network to obtain an estimation of the corresponding PET. Two approaches were implemented and evaluated to determine a PET residue that will be compressed along with the original CT. In the first method, the residue resulting from the differences between the original PET and its estimation is encoded, whereas in the second method, the residue is obtained using encoders inter-prediction coding tools. Thus, in alternative to compressing two independent picture modalities, i.e., both images of the original PET-CT pair solely the CT is independently encoded alongside with the PET residue, in the proposed method. Along with the proposed pipeline, a post-processing optimization algorithm that modifies the estimated PET image by altering the contrast and rescaling the image is implemented to maximize the compression efficiency. Four different versions (subsets) of a publicly available PET-CT pair dataset were tested. The first proposed subset was used to demonstrate that the concept developed in this work is capable of surpassing the traditional compression schemes. The obtained results showed gains of up to 8.9% using the HEVC. On the other side, JPEG2k proved not to be the most suitable as it failed to obtain good results, having reached only -9.1% compression gain. For the remaining (more challenging) subsets, the results reveal that the proposed refined post-processing scheme attains, when compared to conventional compression methods, up 6.33% compression gain using HEVC, and 7.78% using VVC

    Autoencoding sensory substitution

    Get PDF
    Tens of millions of people live blind, and their number is ever increasing. Visual-to-auditory sensory substitution (SS) encompasses a family of cheap, generic solutions to assist the visually impaired by conveying visual information through sound. The required SS training is lengthy: months of effort is necessary to reach a practical level of adaptation. There are two reasons for the tedious training process: the elongated substituting audio signal, and the disregard for the compressive characteristics of the human hearing system. To overcome these obstacles, we developed a novel class of SS methods, by training deep recurrent autoencoders for image-to-sound conversion. We successfully trained deep learning models on different datasets to execute visual-to-auditory stimulus conversion. By constraining the visual space, we demonstrated the viability of shortened substituting audio signals, while proposing mechanisms, such as the integration of computational hearing models, to optimally convey visual features in the substituting stimulus as perceptually discernible auditory components. We tested our approach in two separate cases. In the first experiment, the author went blindfolded for 5 days, while performing SS training on hand posture discrimination. The second experiment assessed the accuracy of reaching movements towards objects on a table. In both test cases, above-chance-level accuracy was attained after a few hours of training. Our novel SS architecture broadens the horizon of rehabilitation methods engineered for the visually impaired. Further improvements on the proposed model shall yield hastened rehabilitation of the blind and a wider adaptation of SS devices as a consequence

    신체 μž„λ² λ”©μ„ ν™œμš©ν•œ μ˜€ν† μΈμ½”λ” 기반 컴퓨터 λΉ„μ „ λͺ¨ν˜•μ˜ μ„±λŠ₯ κ°œμ„ 

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2021.8. λ°•μ’…ν—Œ.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways. This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation. Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition. When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.컴퓨터 비전은 λ”₯λŸ¬λ‹ ν•™μŠ΅ 방법둠이 강점을 λ³΄μ΄λŠ” λΆ„μ•Όλ‘œ, λ‹€μ–‘ν•œ νƒœμŠ€ν¬μ—μ„œ μš°μˆ˜ν•œ μ„±λŠ₯을 보이고 μžˆλ‹€. 특히, μ‚¬λžŒμ΄ ν¬ν•¨λœ μ΄λ―Έμ§€λ‚˜ λ™μ˜μƒμ„ λ”₯λŸ¬λ‹μ„ 톡해 λΆ„μ„ν•˜λŠ” νƒœμŠ€ν¬μ˜ 경우, 졜근 μ†Œμ…œ 미디어에 μ‚¬λžŒμ΄ ν¬ν•¨λœ 이미지 λ˜λŠ” λ™μ˜μƒ κ²Œμ‹œλ¬Όμ΄ λŠ˜μ–΄λ‚˜λ©΄μ„œ κ·Έ ν™œμš© κ°€μΉ˜κ°€ 높아지고 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ‚¬λžŒκ³Ό κ΄€λ ¨λœ 컴퓨터 λΉ„μ „ νƒœμŠ€ν¬ 쀑 νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ λ¬Έμ œμ™€ λ™μž‘ μœ μ‚¬λ„ 츑정에 λŒ€ν•΄ 닀룬닀. νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제의 경우, 데이터 μˆ˜μ§‘ μ‹œμ μ˜ νŒ¨μ…˜ μœ ν–‰μ— 따라 μŠ€νƒ€μΌ ν΄λž˜μŠ€λ³„ μˆ˜μ§‘λ˜λŠ” μƒ˜ν”Œμ˜ 양이 달라지기 λ•Œλ¬Έμ— μ΄λ‘œλΆ€ν„° 클래슀 λΆˆκ· ν˜•μ΄ λ°œμƒν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 클래슀 λΆˆκ· ν˜• λ¬Έμ œμ— λŒ€μ²˜ν•˜κΈ° μœ„ν•˜μ—¬, μ†Œμˆ˜ μƒ˜ν”Œ ν΄λž˜μŠ€μ™€ λ‹€μˆ˜ μƒ˜ν”Œ 클래슀λ₯Ό ν•™μŠ΅ 및 평가에 λͺ¨λ‘ μ‚¬μš©ν•˜λŠ” μΌλ°˜ν™”λœ ν“¨μƒ·λŸ¬λ‹μœΌλ‘œ νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제λ₯Ό μ„€μ •ν•˜μ˜€λ‹€. λ˜ν•œ λ³€λΆ„ μ˜€ν† μΈμ½”λ” 기반의 λͺ¨λΈμ„ 톡해, 신체 및 νŒ¨μ…˜ μ•„μ΄ν…œ λΆ€λΆ„λ§Œ μž˜λΌλ‚Έ μ „κ²½ 이미지 λͺ¨λ‹¬λ¦¬ν‹°μ™€ νŒ¨μ…˜ 속성 정보 λͺ¨λ‹¬λ¦¬ν‹°κ°€ νŒ¨μ…˜ μ΄λ―Έμ§€μ˜ μž„λ² λ”© ν•™μŠ΅μ— λ°˜μ˜λ˜λ„λ‘ ν•˜μ˜€λ‹€. ν•™μŠ΅ 및 평가λ₯Ό μœ„ν•œ λ°μ΄ν„°μ…‹μœΌλ‘œλŠ” ν•œκ΅­ νŒ¨μ…˜ μ‡Όν•‘λͺ°μ—μ„œ μˆ˜μ§‘λœ K-fashion 데이터셋을 μ‚¬μš©ν•˜μ˜€λ‹€. ν•œνŽΈ, λ™μž‘ μœ μ‚¬λ„ 츑정은 ν–‰μœ„ 인식, 이상 λ™μž‘ 감지, μ‚¬λžŒ μž¬μΈμ‹ 같은 λ‹€μ–‘ν•œ λΆ„μ•Όμ˜ ν•˜μœ„ λͺ¨λ“ˆλ‘œ ν™œμš©λ˜κ³  μžˆμ§€λ§Œ κ·Έ μžμ²΄κ°€ μ—°κ΅¬λœ 적은 λ§Žμ§€ μ•Šμ€λ°, μ΄λŠ” 같은 λ™μž‘μ„ μˆ˜ν–‰ν•˜λ”λΌλ„ 신체 ꡬ쑰 및 카메라 각도에 따라 λ‹€λ₯΄κ²Œ ν‘œν˜„λ  수 μžˆλ‹€λŠ” 점으둜 λΆ€ν„° κΈ°μΈν•œλ‹€. ν•™μŠ΅ 및 평가λ₯Ό μœ„ν•œ 곡개 데이터셋이 λ§Žμ§€ μ•Šλ‹€λŠ” 점 λ˜ν•œ 연ꡬλ₯Ό μ–΄λ ΅κ²Œ ν•˜λŠ” μš”μΈμ΄λ‹€. λ”°λΌμ„œ λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•™μŠ΅μ„ μœ„ν•œ 인곡 데이터셋을 μˆ˜μ§‘ν•˜μ—¬ μ˜€ν† μΈμ½”λ” ꡬ쑰λ₯Ό 톡해 신체 ꡬ쑰 및 카메라 각도 μš”μ†Œκ°€ λΆ„λ¦¬λœ λ™μž‘ μž„λ² λ”©μ„ ν•™μŠ΅ν•˜μ˜€λ‹€. μ΄λ•Œ, 각 신체 λΆ€μœ„λ³„λ‘œ λ™μž‘ μž„λ² λ”©μ„ 생성할 수 μžˆλ„λ‘ν•˜μ—¬ 신체 λΆ€μœ„λ³„λ‘œ λ™μž‘ μœ μ‚¬λ„ 츑정이 κ°€λŠ₯ν•˜λ„λ‘ ν•˜μ˜€λ‹€. 두 λ™μž‘ μ‚¬μ΄μ˜ μœ μ‚¬λ„λ₯Ό μΈ‘μ •ν•  λ•Œμ—λŠ” 동적 μ‹œκ°„ μ›Œν•‘ 기법을 μ‚¬μš©, λΉ„μŠ·ν•œ λ™μž‘μ„ μˆ˜ν–‰ν•˜λŠ” ꡬ간끼리 μ •λ ¬μ‹œμΌœ μœ μ‚¬λ„λ₯Ό μΈ‘μ •ν•˜λ„λ‘ ν•¨μœΌλ‘œμ¨, λ™μž‘ μˆ˜ν–‰ μ†λ„μ˜ 차이λ₯Ό λ³΄μ •ν•˜μ˜€λ‹€. 평가λ₯Ό μœ„ν•œ μœ μ‚¬λ„ 점수 데이터셋은 ν–‰μœ„ 인식 데이터셋인 NTU-RGB+D 120의 μ˜μƒμ„ ν™œμš©ν•˜μ—¬ ν¬λΌμš°λ“œ μ†Œμ‹± ν”Œλž«νΌμ„ 톡해 μˆ˜μ§‘λ˜μ—ˆλ‹€. 두 가지 νƒœμŠ€ν¬μ˜ μ œμ•ˆ λͺ¨λΈμ„ 각각의 평가 λ°μ΄ν„°μ…‹μœΌλ‘œ κ²€μ¦ν•œ κ²°κ³Ό, λͺ¨λ‘ 비ꡐ λͺ¨λΈ λŒ€λΉ„ μš°μˆ˜ν•œ μ„±λŠ₯을 κΈ°λ‘ν•˜μ˜€λ‹€. νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제의 경우, λͺ¨λ“  λΉ„κ΅κ΅°μ—μ„œ μ†Œμˆ˜ μƒ˜ν”Œ ν΄λž˜μŠ€μ™€ λ‹€μˆ˜ μƒ˜ν”Œ 클래슀 쀑 ν•œ μͺ½μœΌλ‘œ μΉ˜μš°μΉ˜μ§€ μ•ŠλŠ” κ°€μž₯ κ· ν˜•μž‘νžŒ μΆ”λ‘  μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆκ³ , λ™μž‘ μœ μ‚¬λ„ μΈ‘μ •μ˜ 경우 μ‚¬λžŒμ΄ μΈ‘μ •ν•œ μœ μ‚¬λ„ μ μˆ˜μ™€ μƒκ΄€κ³„μˆ˜μ—μ„œ 비ꡐ λͺ¨λΈ λŒ€λΉ„ 더 높은 수치λ₯Ό λ‚˜νƒ€λ‚΄μ—ˆλ‹€.Chapter 1 Introduction 1 1.1 Background and motivation 1 1.2 Research contribution 5 1.2.1 Fashion style classication 5 1.2.2 Human motion similarity 9 1.2.3 Summary of the contributions 11 1.3 Thesis outline 13 Chapter 2 Literature Review 14 2.1 Fashion style classication 14 2.1.1 Machine learning and deep learning-based approaches 14 2.1.2 Class imbalance 15 2.1.3 Variational autoencoder 17 2.2 Human motion similarity 19 2.2.1 Measuring the similarity between two people 19 2.2.2 Human body embedding 20 2.2.3 Datasets for measuring the similarity 20 2.2.4 Triplet and quadruplet losses 21 2.2.5 Dynamic time warping 22 Chapter 3 Fashion Style Classication 24 3.1 Dataset for fashion style classication: K-fashion 24 3.2 Multimodal variational inference for fashion style classication 28 3.2.1 CADA-VAE 31 3.2.2 Generating multimodal features 33 3.2.3 Classier training with cyclic oversampling 36 3.3 Experimental results for fashion style classication 38 3.3.1 Implementation details 38 3.3.2 Settings for experiments 42 3.3.3 Experimental results on K-fashion 44 3.3.4 Qualitative analysis 48 3.3.5 Eectiveness of the cyclic oversampling 50 Chapter 4 Motion Similarity Measurement 53 4.1 Datasets for motion similarity 53 4.1.1 Synthetic motion dataset: SARA dataset 53 4.1.2 NTU RGB+D 120 similarity annotations 55 4.2 Framework for measuring motion similarity 58 4.2.1 Body part embedding model 58 4.2.2 Measuring motion similarity 67 4.3 Experimental results for measuring motion similarity 68 4.3.1 Implementation details 68 4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72 4.3.3 Visualization of motion latent clusters 78 4.4 Application 81 4.4.1 Real-world application with dancing videos 81 4.4.2 Tuning similarity scores to match human perception 87 Chapter 5 Conclusions 89 5.1 Summary and contributions 89 5.2 Limitations and future research 91 Appendices 93 Chapter A NTU RGB+D 120 Similarity Annotations 94 A.1 Data collection 94 A.2 AMT score analysis 95 Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100 Chapter C Motion Sequence Generation Using Mixamo 102 Bibliography 104 ꡭ문초둝 123λ°•

    Exploring variability in medical imaging

    Get PDF
    Although recent successes of deep learning and novel machine learning techniques improved the perfor- mance of classification and (anomaly) detection in computer vision problems, the application of these methods in medical imaging pipeline remains a very challenging task. One of the main reasons for this is the amount of variability that is encountered and encapsulated in human anatomy and subsequently reflected in medical images. This fundamental factor impacts most stages in modern medical imaging processing pipelines. Variability of human anatomy makes it virtually impossible to build large datasets for each disease with labels and annotation for fully supervised machine learning. An efficient way to cope with this is to try and learn only from normal samples. Such data is much easier to collect. A case study of such an automatic anomaly detection system based on normative learning is presented in this work. We present a framework for detecting fetal cardiac anomalies during ultrasound screening using generative models, which are trained only utilising normal/healthy subjects. However, despite the significant improvement in automatic abnormality detection systems, clinical routine continues to rely exclusively on the contribution of overburdened medical experts to diagnosis and localise abnormalities. Integrating human expert knowledge into the medical imaging processing pipeline entails uncertainty which is mainly correlated with inter-observer variability. From the per- spective of building an automated medical imaging system, it is still an open issue, to what extent this kind of variability and the resulting uncertainty are introduced during the training of a model and how it affects the final performance of the task. Consequently, it is very important to explore the effect of inter-observer variability both, on the reliable estimation of model’s uncertainty, as well as on the model’s performance in a specific machine learning task. A thorough investigation of this issue is presented in this work by leveraging automated estimates for machine learning model uncertainty, inter-observer variability and segmentation task performance in lung CT scan images. Finally, a presentation of an overview of the existing anomaly detection methods in medical imaging was attempted. This state-of-the-art survey includes both conventional pattern recognition methods and deep learning based methods. It is one of the first literature surveys attempted in the specific research area.Open Acces
    • …
    corecore