Search CORE

150 research outputs found

Unsupervised Clustering of Quantitative Imaging Phenotypes using Autoencoder and Gaussian Mixture Model

Author: B Wang
D Kontos
HM Cheung
J Reimand
J Wu
JJ Griethuysen Van
MAT Figueiredo
S Ha
S Monti
Y Fong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/09/2019
Field of study

Quantitative medical image computing (radiomics) has been widely applied to build prediction models from medical images. However, overfitting is a significant issue in conventional radiomics, where a large number of radiomic features are directly used to train and test models that predict genotypes or clinical outcomes. In order to tackle this problem, we propose an unsupervised learning pipeline composed of an autoencoder for representation learning of radiomic features and a Gaussian mixture model based on minimum message length criterion for clustering. By incorporating probabilistic modeling, disease heterogeneity has been taken into account. The performance of the proposed pipeline was evaluated on an institutional MRI cohort of 108 patients with colorectal cancer liver metastases. Our approach is capable of automatically selecting the optimal number of clusters and assigns patients into clusters (imaging subtypes) with significantly different survival rates. Our method outperforms other unsupervised clustering methods that have been used for radiomics analysis and has comparable performance to a state-of-the-art imaging biomarker.Comment: Accepted at MICCAI 201

arXiv.org e-Print Archive

Crossref

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Machine learning to empower electrohydrodynamic processing

Author: Basit AW
Elbadawi M
Gaisford S
Parhizkar M
Tsilova SL
Wang F
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Electrohydrodynamic (EHD) processes are promising healthcare fabrication technologies, as evidenced by the number of commercialised and food-and-drug administration (FDA)-approved products produced by these processes. Their ability to produce both rapidly and precisely nano-sized products provides them with a unique set of qualities that cannot be matched by other fabrication technologies. Consequently, this has stimulated the development of EHD processing to tackle other healthcare challenges. However, as with most technologies, time and resources will be needed to realise fully the potential EHD processes can offer. To address this bottleneck, researchers are adopting machine learning (ML), a subset of artificial intelligence, into their workflow. ML has already made ground-breaking advancements in the healthcare sector, and it is anticipated to do the same in the materials domain. Presently, the application of ML in fabrication technologies lags behind other sectors. To that end, this review showcases the progress made by ML for EHD workflows, demonstrating how the latter can benefit greatly from the former. In addition, we provide an introduction to the ML pipeline, to help encourage the use of ML for other EHD researchers. As discussed, the merger of ML with EHD has the potential to expedite novel discoveries and to automate the EHD workflow

UCL Discovery

Towards generalizable machine learning models for computer-aided diagnosis in medicine

Author: Wang Yiyang
Publication venue: DePaul University
Publication date: 30/05/2023
Field of study

Hidden stratification represents a phenomenon in which a training dataset contains unlabeled (hidden) subsets of cases that may affect machine learning model performance. Machine learning models that ignore the hidden stratification phenomenon--despite promising overall performance measured as accuracy and sensitivity--often fail at predicting the low prevalence cases, but those cases remain important. In the medical domain, patients with diseases are often less common than healthy patients, and a misdiagnosis of a patient with a disease can have significant clinical impacts. Therefore, to build a robust and trustworthy CAD system and a reliable treatment effect prediction model, we cannot only pursue machine learning models with high overall accuracy, but we also need to discover any hidden stratification in the data and evaluate the proposing machine learning models with respect to both overall performance and the performance on certain subsets (groups) of the data, such as the ‘worst group’. In this study, I investigated three approaches for data stratification: a novel algorithmic deep learning (DL) approach that learns similarities among cases and two schema completion approaches that utilize domain expert knowledge. I further proposed an innovative way to integrate the discovered latent groups into the loss functions of DL models to allow for better model generalizability under the domain shift scenario caused by the data heterogeneity. My results on lung nodule Computed Tomography (CT) images and breast cancer histopathology images demonstrate that learning homogeneous groups within heterogeneous data significantly improves the performance of the computer-aided diagnosis (CAD) system, particularly for low-prevalence or worst-performing cases. This study emphasizes the importance of discovering and learning the latent stratification within the data, as it is a critical step towards building ML models that are generalizable and reliable. Ultimately, this discovery can have a profound impact on clinical decision-making, particularly for low-prevalence cases

Via Sapientiae: The Institutional Repository at DePaul University

Deep Learning in Single-Cell Analysis

Author: Danaher Patrick
Ding Jiayuan
Jin Wei
Lei Yu Leo
Li Zhaoheng
Liu Renming
Molho Dylan
Su Runze
Tang Jiliang
Tang Wenzhuo
Venegas Julian
Wang Yixin
Wen Hongzhi
Xie Yuying
Yang Robert
Publication venue
Publication date: 05/11/2022
Field of study

Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

arXiv.org e-Print Archive

Recommended from our members

Interpreting Deep Learning for cell differentiation. Supervised and Unsupervised models viewed through the lens of information and perturbation theory.

Author: Andres Terre Helena
Publication venue: University of Cambridge
Publication date: 05/12/2019
Field of study

"Predicting the future isn't magic, it's artificial intelligence" Dave Waters. In the last decades there has been an unprecedented growth in the field of machine learning, and particularly within deep learning models. The combination of big data and computational power has nurtured the evolution of a variety of new methods to predict and interpret future scenarios. These data centric models can achieve exceptional performances on specific tasks, with their prediction boundaries continuously expanding towards new and more complex challenges. However, the model complexity often translates into a lack of interpretability from a scientific c perspective, it is not trivial to identify the factors involved in final outcomes. Explainability may not always be a requirement for some machine learning tasks, specially when it comes in detriment of performance power. But for some applications, such as biological discoveries or medical diagnostics, understanding the output and determining factors that influence decisions is essential. In this thesis we develop both a supervised and unsupervised approach to map from genotype to phenotype. We emphasise the importance of interpretability and feature extraction from the models, by identifying relevant genes for cell differentiation. We then continue to explore the rules and mechanisms behind the models from a theoretical perspective. Using information theory to explain the learning process and applying perturbation theory to transform the results into a generalisable representation. We start by building a supervised approach to mapping cell profiles from genotype to phenotype, using single cell RNA-Seq data. We leverage non-linearities among gene expressions to identify cellular levels of differentiation. The ambiguity and even absence of labels in most biological studies instigated the development of novel unsupervised techniques, leading to a new general and biologically interpretable framework based on Variational Autoencoders. The application and validation of the methods has proven to be successful, but questions regarding the learning process and generative nature of the results remained unanswered. I use information theory to define a new approach to interpret training and the converged solutions of our models. The variational and generative nature of Autoencoders provides a platform to develop general models. Their results should extrapolate and allow generalisation beyond the boundaries of the observed data. To this extent, we introduce for the first time a new interpretation of the embedded generative functions through Perturbation Theory. The embedding multiplicity is addressed by transforming the distributions into a new set of generalisable functions, while characterising their energy spectrum under a particular energy landscape. We outline the combination of theoretical and machine learning based methods, for moving towards interpretable and generalisable models. Developing a theoretical framework to map from genotype to phenotype, we provide both supervised and unsupervised tools to operate over single cell RNA-Seq. data. We have generated a pipeline to identify relevant genes and cell types through Variational Autoencoders (VAEs), validating reconstructed gene expressions to prove the generative performance of the embeddings. The new interpretation of the information learned and extracted by the models de fines a label independent evaluation, particularly useful for unsupervised learning. Lastly, we introduce a novel transformation of the generative embeddings based on quantum and perturbation theory. Our contributions can and have been extended to new datasets, according to the nature of the tasks being explored. For instance, the combination of unsupervised learning and information theory can be applied to a variety of biological or medical data. We have trained several VAE models with additional cancer and metabolic data, proving to extract meaningful representations of the data. The perturbation theory transformation of the embedding can also lead to future research on the generative potential of Variational Autoencoders through a physics perspective, combining statistical and quantum mechanics. We believe that machine learning will only continue its fast expansion and growth through the development of more generalisable more interpretable models. "Prediction is very difficult, especially if it's about the future" Niels Boh

Apollo (Cambridge)

Neuroimaging Feature Extraction using a Neural Network Classifier for Imaging Genetics

Author: Beaulac Cédric
Beg Mirza Faisal
Cao Jiguo
Gibson Erin
Miranda Michelle F.
Nathoo Farouk S.
Rocha Leno
Wu Sidi
Publication venue
Publication date: 08/07/2022
Field of study

A major issue in the association of genes to neuroimaging phenotypes is the high dimension of both genetic data and neuroimaging data. In this article, we tackle the latter problem with an eye toward developing solutions that are relevant for disease prediction. Supported by a vast literature on the predictive power of neural networks, our proposed solution uses neural networks to extract from neuroimaging data features that are relevant for predicting Alzheimer's Disease (AD) for subsequent relation to genetics. Our neuroimaging-genetic pipeline is comprised of image processing, neuroimaging feature extraction and genetic association steps. We propose a neural network classifier for extracting neuroimaging features that are related with disease and a multivariate Bayesian group sparse regression model for genetic association. We compare the predictive power of these features to expert selected features and take a closer look at the SNPs identified with the new neuroimaging features.Comment: Under revie

arXiv.org e-Print Archive

Multimodal Data Fusion and Quantitative Analysis for Medical Applications

Author: Xin Bowen
Publication venue: Faculty of Engineering, School of Computer Science
Publication date: 01/01/2021
Field of study

Medical big data is not only enormous in its size, but also heterogeneous and complex in its data structure, which makes conventional systems or algorithms difficult to process. These heterogeneous medical data include imaging data (e.g., Positron Emission Tomography (PET), Computerized Tomography (CT), Magnetic Resonance Imaging (MRI)), and non-imaging data (e.g., laboratory biomarkers, electronic medical records, and hand-written doctor notes). Multimodal data fusion is an emerging vital field to address this urgent challenge, aiming to process and analyze the complex, diverse and heterogeneous multimodal data. The fusion algorithms bring great potential in medical data analysis, by 1) taking advantage of complementary information from different sources (such as functional-structural complementarity of PET/CT images) and 2) exploiting consensus information that reflects the intrinsic essence (such as the genetic essence underlying medical imaging and clinical symptoms). Thus, multimodal data fusion benefits a wide range of quantitative medical applications, including personalized patient care, more optimal medical operation plan, and preventive public health. Though there has been extensive research on computational approaches for multimodal fusion, there are three major challenges of multimodal data fusion in quantitative medical applications, which are summarized as feature-level fusion, information-level fusion and knowledge-level fusion: • Feature-level fusion. The first challenge is to mine multimodal biomarkers from high-dimensional small-sample multimodal medical datasets, which hinders the effective discovery of informative multimodal biomarkers. Specifically, efficient dimension reduction algorithms are required to alleviate "curse of dimensionality" problem and address the criteria for discovering interpretable, relevant, non-redundant and generalizable multimodal biomarkers. • Information-level fusion. The second challenge is to exploit and interpret inter-modal and intra-modal information for precise clinical decisions. Although radiomics and multi-branch deep learning have been used for implicit information fusion guided with supervision of the labels, there is a lack of methods to explicitly explore inter-modal relationships in medical applications. Unsupervised multimodal learning is able to mine inter-modal relationship as well as reduce the usage of labor-intensive data and explore potential undiscovered biomarkers; however, mining discriminative information without label supervision is an upcoming challenge. Furthermore, the interpretation of complex non-linear cross-modal associations, especially in deep multimodal learning, is another critical challenge in information-level fusion, which hinders the exploration of multimodal interaction in disease mechanism. • Knowledge-level fusion. The third challenge is quantitative knowledge distillation from multi-focus regions on medical imaging. Although characterizing imaging features from single lesions using either feature engineering or deep learning methods have been investigated in recent years, both methods neglect the importance of inter-region spatial relationships. Thus, a topological profiling tool for multi-focus regions is in high demand, which is yet missing in current feature engineering and deep learning methods. Furthermore, incorporating domain knowledge with distilled knowledge from multi-focus regions is another challenge in knowledge-level fusion. To address the three challenges in multimodal data fusion, this thesis provides a multi-level fusion framework for multimodal biomarker mining, multimodal deep learning, and knowledge distillation from multi-focus regions. Specifically, our major contributions in this thesis include: • To address the challenges in feature-level fusion, we propose an Integrative Multimodal Biomarker Mining framework to select interpretable, relevant, non-redundant and generalizable multimodal biomarkers from high-dimensional small-sample imaging and non-imaging data for diagnostic and prognostic applications. The feature selection criteria including representativeness, robustness, discriminability, and non-redundancy are exploited by consensus clustering, Wilcoxon filter, sequential forward selection, and correlation analysis, respectively. SHapley Additive exPlanations (SHAP) method and nomogram are employed to further enhance feature interpretability in machine learning models. • To address the challenges in information-level fusion, we propose an Interpretable Deep Correlational Fusion framework, based on canonical correlation analysis (CCA) for 1) cohesive multimodal fusion of medical imaging and non-imaging data, and 2) interpretation of complex non-linear cross-modal associations. Specifically, two novel loss functions are proposed to optimize the discovery of informative multimodal representations in both supervised and unsupervised deep learning, by jointly learning inter-modal consensus and intra-modal discriminative information. An interpretation module is proposed to decipher the complex non-linear cross-modal association by leveraging interpretation methods in both deep learning and multimodal consensus learning. • To address the challenges in knowledge-level fusion, we proposed a Dynamic Topological Analysis framework, based on persistent homology, for knowledge distillation from inter-connected multi-focus regions in medical imaging and incorporation of domain knowledge. Different from conventional feature engineering and deep learning, our DTA framework is able to explicitly quantify inter-region topological relationships, including global-level geometric structure and community-level clusters. K-simplex Community Graph is proposed to construct the dynamic community graph for representing community-level multi-scale graph structure. The constructed dynamic graph is subsequently tracked with a novel Decomposed Persistence algorithm. Domain knowledge is incorporated into the Adaptive Community Profile, summarizing the tracked multi-scale community topology with additional customizable clinically important factors

Sydney eScholarship

Machine learning approaches to model cardiac shape in large-scale imaging studies

Author: Biffi Carlo
Publication venue: Department of Computing, Imperial College London
Publication date: 01/10/2020
Field of study

Recent improvements in non-invasive imaging, together with the introduction of fully-automated segmentation algorithms and big data analytics, has paved the way for large-scale population-based imaging studies. These studies promise to increase our understanding of a large number of medical conditions, including cardiovascular diseases. However, analysis of cardiac shape in such studies is often limited to simple morphometric indices, ignoring large part of the information available in medical images. Discovery of new biomarkers by machine learning has recently gained traction, but often lacks interpretability. The research presented in this thesis aimed at developing novel explainable machine learning and computational methods capable of better summarizing shape variability, to better inform association and predictive clinical models in large-scale imaging studies. A powerful and flexible framework to model the relationship between three-dimensional (3D) cardiac atlases, encoding multiple phenotypic traits, and genetic variables is first presented. The proposed approach enables the detection of regional phenotype-genotype associations that would be otherwise neglected by conventional association analysis. Three learning-based systems based on deep generative models are then proposed. In the first model, I propose a classifier of cardiac shapes which exploits task-specific generative shape features, and it is designed to enable the visualisation of the anatomical effect these features encode in 3D, making the classification task transparent. The second approach models a database of anatomical shapes via a hierarchy of conditional latent variables and it is capable of detecting, quantifying and visualising onto a template shape the most discriminative anatomical features that characterize distinct clinical conditions. Finally, a preliminary analysis of a deep learning system capable of reconstructing 3D high-resolution cardiac segmentations from a sparse set of 2D views segmentations is reported. This thesis demonstrates that machine learning approaches can facilitate high-throughput analysis of normal and pathological anatomy and of its determinants without losing clinical interpretability.Open Acces

Spiral - Imperial College Digital Repository

Oil and Gas flow Anomaly Detection on offshore naturally flowing wells using Deep Neural Networks

Author: Bayazitova Guzel
Publication venue
Publication date: 23/10/2023
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe Oil and Gas industry, as never before, faces multiple challenges. It is being impugned for being dirty, a pollutant, and hence the more demand for green alternatives. Nevertheless, the world still has to rely heavily on hydrocarbons, since it is the most traditional and stable source of energy, as opposed to extensively promoted hydro, solar or wind power. Major operators are challenged to produce the oil more efficiently, to counteract the newly arising energy sources, with less of a climate footprint, more scrutinized expenditure, thus facing high skepticism regarding its future. It has to become greener, and hence to act in a manner not required previously. While most of the tools used by the Hydrocarbon E&P industry is expensive and has been used for many years, it is paramount for the industry’s survival and prosperity to apply predictive maintenance technologies, that would foresee potential failures, making production safer, lowering downtime, increasing productivity and diminishing maintenance costs. Many efforts were applied in order to define the most accurate and effective predictive methods, however data scarcity affects the speed and capacity for further experimentations. Whilst it would be highly beneficial for the industry to invest in Artificial Intelligence, this research aims at exploring, in depth, the subject of Anomaly Detection, using the open public data from Petrobras, that was developed by experts. For this research the Deep Learning Neural Networks, such as Recurrent Neural Networks with LSTM and GRU backbones, were implemented for multi-class classification of undesirable events on naturally flowing wells. Further, several hyperparameter optimization tools were explored, mainly focusing on Genetic Algorithms as being the most advanced methods for such kind of tasks. The research concluded with the best performing algorithm with 2 stacked GRU and the following vector of hyperparameters weights: [1, 47, 40, 14], which stand for timestep 1, number of hidden units 47, number of epochs 40 and batch size 14, producing F1 equal to 0.97%. As the world faces many issues, one of which is the detrimental effect of heavy industries to the environment and as result adverse global climate change, this project is an attempt to contribute to the field of applying Artificial Intelligence in the Oil and Gas industry, with the intention to make it more efficient, transparent and sustainable

Repositório da Universidade Nova de Lisboa