121 research outputs found
Machine Learning Applied to Raman Spectroscopy to Classify Cancers
Cancer diagnosis is notoriously difficult, evident in the inter-rater variability between
histopathologists classifying cancerous sub-types. Although there are many cancer
pathologies, they have in common that earlier diagnosis would maximise treatment
potential. To reduce this variability and expedite diagnosis, there has been a drive to
arm histopathologists with additional tools. One such tool is Raman spectroscopy,
which has demonstrated potential in distinguishing between various cancer types.
However, Raman data has high dimensionality and often contains artefacts and
together with challenges inherent to medical data, classification attempts can be
frustrated. Deep learning has recently emerged with the promise of unlocking many
complex datasets, but it is not clear how this modelling paradigm can best exploit
Raman data for cancer diagnosis.
Three Raman oncology datasets (from ovarian, colonic and oesophageal tissue)
were used to examine various methodological challenges to machine learning applied
to Raman data, in conjunction with a thorough review of the recent literature. The
performance of each dataset is assessed with two traditional and one deep learning
models. A technique is then applied to the deep learning model to aid interpretability
and relate biochemical antecedents to disease classes. In addition, a clinical problem
for each dataset was addressed, including the transferability of models developed
using multi-centre Raman data taken different on spectrometers of the same make.
Many subtleties of data processing were found to be important to the realistic
assessment of a machine learning models. In particular, appropriate cross-validation
during hyperparameter selection, splitting data into training and test sets according
to the inherent structure of biomedical data and addressing the number of samples
Abstract "
per disease class are all found to be important factors. Additionally, it was found that
instrument correction was not needed to ensure system transferability if Raman data
is collected with a common protocol on spectrometers of the same make
Radiogenomics Framework for Associating Medical Image Features with Tumour Genetic Characteristics
Significant progress has been made in the understanding of human cancers at the molecular genetics level and it is providing new insights into their underlying pathophysiology. This progress has enabled the subclassification of the disease and the development of targeted therapies that address specific biological pathways. However, obtaining genetic information remains invasive and costly. Medical imaging is a non-invasive technique that captures important visual characteristics (i.e. image features) of abnormalities and plays an important role in routine clinical practice. Advancements in computerised medical image analysis have enabled quantitative approaches to extract image features that can reflect tumour genetic characteristics, leading to the emergence of ‘radiogenomics’. Radiogenomics investigates the relationships between medical imaging features and tumour molecular characteristics, and enables the derivation of imaging surrogates (radiogenomics features) to genetic biomarkers that can provide alternative approaches to non-invasive and accurate cancer diagnosis.
This thesis presents a new framework that combines several novel methods for radiogenomics analysis that associates medical image features with tumour genetic characteristics, with the main objectives being: i) a comprehensive characterisation of tumour image features that reflect underlying genetic information; ii) a method that identifies radiogenomics features encoding common pathophysiological information across different diseases, overcoming the dependence on large annotated datasets; and iii) a method that quantifies radiogenomics features from multi-modal imaging data and accounts for unique information encoded in tumour heterogeneity sub-regions. The present radiogenomics methods advance radiogenomics analysis and contribute to improving research in computerised medical image analysis
Network-based methods for biological data integration in precision medicine
[eng] The vast and continuously increasing volume of available biomedical data produced during the last decades opens new opportunities for large-scale modeling of disease biology, facilitating a more comprehensive and integrative understanding of its processes. Nevertheless, this type of modelling requires highly efficient computational systems capable of dealing with such levels of data volumes.
Computational approximations commonly used in machine learning and data analysis, namely dimensionality reduction and network-based approaches, have been developed with the goal of effectively integrating biomedical data. Among these methods, network-based machine learning stands out due to its major advantage in terms of biomedical interpretability. These methodologies provide a highly intuitive framework for the integration and modelling of biological processes.
This PhD thesis aims to explore the potential of integration of complementary available biomedical knowledge with patient-specific data to provide novel computational approaches to solve biomedical scenarios characterized by data scarcity. The primary focus is on studying how high-order graph analysis (i.e., community detection in multiplex and multilayer networks) may help elucidate the interplay of different types of data in contexts where statistical power is heavily impacted by small sample sizes, such as rare diseases and precision oncology.
The central focus of this thesis is to illustrate how network biology, among the several data integration approaches with the potential to achieve this task, can play a pivotal role in addressing this challenge provided its advantages in molecular interpretability. Through its insights and methodologies, it introduces how network biology, and in particular, models based on multilayer networks, facilitates bringing the vision of precision medicine to these complex scenarios, providing a natural approach for the discovery of new biomedical relationships that overcomes the difficulties for the study of cohorts presenting limited sample sizes (data-scarce scenarios).
Delving into the potential of current artificial intelligence (AI) and network biology applications to address data granularity issues in the precision medicine field, this PhD thesis presents pivotal research works, based on multilayer networks, for the analysis of two rare disease scenarios with specific data granularities, effectively overcoming the classical constraints hindering rare disease and precision oncology research.
The first research article presents a personalized medicine study of the molecular determinants of severity in congenital myasthenic syndromes (CMS), a group of rare disorders of the neuromuscular junction (NMJ). The analysis of severity in rare diseases, despite its importance, is typically neglected due to data availability. In this study, modelling of biomedical knowledge via multilayer networks allowed understanding the functional implications of individual mutations in the cohort under study, as well as their relationships with the causal mutations of the disease and the different levels of severity observed. Moreover, the study presents experimental evidence of the role of a previously unsuspected gene in NMJ activity, validating the hypothetical role predicted using the newly introduced methodologies.
The second research article focuses on the applicability of multilayer networks for gene priorization. Enhancing concepts for the analysis of different data granularities firstly introduced in the previous article, the presented research provides a methodology based on the persistency of network community structures in a range of modularity resolution, effectively providing a new framework for gene priorization for patient stratification.
In summary, this PhD thesis presents major advances on the use of multilayer network-based approaches for the application of precision medicine to data-scarce scenarios, exploring the potential of integrating extensive available biomedical knowledge with patient-specific data
Interpretable neural architecture search and transfer learning for understanding CRISPR/Cas9 off-target enzymatic reactions
Finely-tuned enzymatic pathways control cellular processes, and their
dysregulation can lead to disease. Creating predictive and interpretable models
for these pathways is challenging because of the complexity of the pathways and
of the cellular and genomic contexts. Here we introduce Elektrum, a deep
learning framework which addresses these challenges with data-driven and
biophysically interpretable models for determining the kinetics of biochemical
systems. First, it uses in vitro kinetic assays to rapidly hypothesize an
ensemble of high-quality Kinetically Interpretable Neural Networks (KINNs) that
predict reaction rates. It then employs a novel transfer learning step, where
the KINNs are inserted as intermediary layers into deeper convolutional neural
networks, fine-tuning the predictions for reaction-dependent in vivo outcomes.
Elektrum makes effective use of the limited, but clean in vitro data and the
complex, yet plentiful in vivo data that captures cellular context. We apply
Elektrum to predict CRISPR-Cas9 off-target editing probabilities and
demonstrate that Elektrum achieves state-of-the-art performance, regularizes
neural network architectures, and maintains physical interpretability.Comment: 23 pages, 4 figure
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Enabling high-throughput image analysis with deep learning-based tools
Microscopes are a valuable tool in biological research, facilitating information gathering with different magnification scales, samples and markers in single-cell and whole-population studies. However, image acquisition and analysis are very time-consuming, so efficient solutions are needed for the required speed-up to allow high-throughput microscopy.
Throughout the work presented in this thesis, I developed new computational methods and software packages to facilitate high-throughput microscopy. My work comprised not only the development of these methods themselves but also their integration into the workflow of the lab, starting from automating the microscopy acquisition to deploying scalable analysis services and providing user-friendly local user interfaces.
The main focus of my thesis was YeastMate, a tool for automatic detection and segmentation of yeast cells and sub-type classification of their life-cycle transitions. Development of YeastMate was mainly driven by research on quality control mechanisms of the mitochondrial genome in S. cerevisiae, where yeast cells are imaged during their sexual and asexual reproduction life-cycle stages. YeastMate can automatically detect both single cells and life-cycle transitions, perform segmentation and enable pedigree analysis by determining origin and offspring cells. I developed a novel adaptation of the Mask R-CNN object detection model to integrate the classification of inter-cell connections into the usual detection and segmentation analysis pipelines.
Another part of my work focused on the automation of microscopes themselves using deep learning models to detect wings of D. melanogaster. A microscope was programmed to acquire large overview images and then to acquire detailed images at higher magnification on the detected coordinates of each wing. The implementation of this workflow replaced the process of manually imaging slides, usually taking hours to do so, with a fully automated, end-to-end solution
A Hybrid Metaheuristics based technique for Mutation Based Disease Classification
Due to recent advancements in computational biology, DNA microarray technology has evolved as a useful tool in the detection of mutation among various complex diseases like cancer. The availability of thousands of microarray datasets makes this field an active area of research. Early cancer detection can reduce the mortality rate and the treatment cost. Cancer classification is a process to provide a detailed overview of the disease microenvironment for better diagnosis. However, the gene microarray datasets suffer from a curse of dimensionality problems also the classification models are prone to be overfitted due to small sample size and large feature space. To address these issues, the authors have proposed an Improved Binary Competitive Swarm Optimization Whale Optimization Algorithm (IBCSOWOA) for cancer classification, in which IBCSO has been employed to reduce the informative gene subset originated from using minimum redundancy maximum relevance (mRMR) as filter method. The IBCSOWOA technique has been tested on an artificial neural network (ANN) model and the whale optimization algorithm (WOA) is used for parameter tuning of the model. The performance of the proposed IBCSOWOA is tested on six different mutation-based microarray datasets and compared with existing disease prediction methods. The experimental results indicate the superiority of the proposed technique over the existing nature-inspired methods in terms of optimal feature subset, classification accuracy, and convergence rate. The proposed technique has illustrated above 98% accuracy in all six datasets with the highest accuracy of 99.45% in the Lung cancer dataset
Large-scale monitoring campaigns of contaminants of emerging concern in the environment employing high- resolution mass spectrometry
Χιλιάδες Αναδυόμενοι Ρύποι (ΑΡ) απελευθερώνονται από, σημειακές και μη, πηγές ρύπανσης στα επιφανειακά ύδατα. Οι σημειακές πηγές αποτελούν μέσο εναπόθεσης υψηλών φορτίων ΑΡ στο περιβάλλον, διότι οι διεργασίες που εφαρμόζονται στα Κέντρα Επεξεργασίας Λυμάτων (ΚΕΛ) δεν επιτρέπουν την πλήρη απομάκρυνσή τους. Συνεπώς, οι πιο ανθεκτικοί καταλήγουν σε δεξαμενές γλυκού νερού, υπόγεια ύδατα, στο πόσιμο νερό και εισέρχονται στην τροφική αλυσίδα. Η παρουσία τους στην πανίδα απειλεί τη σταθερότητα των οικοσυστημάτων λόγω της τοξικότητας και της βιοσυσσώρευσης σε ζωικούς οργανισμούς που βρίσκονται σε υψηλότερα τροφικά επίπεδα. Παρ’ όλο που ο αριθμός των ερευνών για τους ΑΡ συνεχώς αυξάνεται, η μελέτη της συμπεριφοράς τους στα οικοσυστήματα και το πλήθος των φυσικών και μη διεργασιών που πραγματοποιούνται παραμένει πρόκληση. Ως γνωστόν, οι ΑΡ δημιουργούν σύνθετες μήτρες άγνωστης σύστασης, γεγονός που καθιστά την παρακολούθησή τους πρόκληση, εκτός αν χρησιμοποιούνται προηγμένες μέθοδοι σάρωσης και όργανα τελευταίας τεχνολογίας.
Η παρούσα διατριβή διακρίνεται σε τρεις εργασίες που στοχεύουν (i) στον χαρακτηρισμό ΑΡ σε οικοσυστήματα υψηλής περιβαλλοντικής σημασίας (ποταμός Δούναβης) χρησιμοποιώντας προηγμένες αναλυτικές μεθόδους και εργαλεία επεξεργασίας δεδομένων, (ii) στην εκτίμηση κινδύνου και τοξικότητας όσων ταυτοποιηθούν στα δείγματα και (iii) στην εκτίμηση των επιπέδων συγκέντρωσης των ΑΡ σε δείγματα εισροών λυμάτων με τη χρήση Επιδημιολογίας Λυμάτων, ένα χημικό εργαλείο που αντικατοπτρίζει τις συνήθειες και τη δημόσια υγεία του πληθυσμού που εξυπηρετείται από το ΚΕΛ. Λόγω της εν εξελίξει πανδημίας που οφείλεται στη νόσο Corona Virus Disease 2019 (COVID-19) και των μεταλλάξεων του αρχικού στελέχους, θα αναπτυχθεί και επικυρωθεί αναλυτικό πρωτόκολλο που περιλαμβάνει τρία στάδια (προσυγκέντρωση, απομόνωση και ανίχνευση) με σκοπό την εκτίμηση του ιικού φορτίου σε δείγματα εισροών αστικών λυμάτων από την Αθήνα και την ανάπτυξη ενός συστήματος έγκαιρης προειδοποίησης για την εξέλιξη της πανδημίας.Thousands of contaminants of emerging concern (CECs) are released from diffuse and point sources in surface waters. Point sources represent a major input of high loads of CECs in the environment because the technology applied in Wastewater Treatment plants (WWTPs) is insufficient to eliminate them. Consequently, the most persistent CECs end up in freshwater reservoirs, groundwater and even drinking water. Additionally, CECs may also enter the trophic chain depending on their properties. The occurrence of CECs in biota can threaten the stability of the ecosystems due to their toxicity and potential bioaccumulation to animals of higher trophic levels. Even though there is an increasing number of studies dealing with CECs in the literature, the investigation of their behavior in the ecosystem and the various natural and non-natural processes remains a challenge. CECs are known to create complex mixtures of unknown composition, which makes monitoring these substances challenging unless wide-scope screening methods and state-of-the-art analytical instrumentation is utilized.
The proposed thesis is divided into three working packages (WP) that aim at (i) characterizing CECs in ecosystems of decisive environmental importance (Danube river basin) using advanced analytical methods and data processing tools, (ii) performing risk assessment to prioritize the compounds based on their hazard and (iii) evaluating the concentration levels of CECs in the influent wastewater and application of Wastewater- based Epidemiology (WBE), which is a chemical tool used to reflect the lifestyle and public health of the WWTP serving population. Due to the Corona Virus Disease 2019 (COVID-19) ongoing pandemic and SARS-CoV-2 variants, an analytical protocol including three steps (concentration, extraction and detection) will be developed and validated, in order to estimate the virus load in influent wastewater from Athens. Wastewater surveillance could be used as an early warning system for epidemics
Iterative Machine Learning of a Cis-Regulatory Grammar
Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation
DEEP LEARNING METHODS FOR PREDICTION OF AND ESCAPE FROM PROTEIN RECOGNITION
Protein interactions drive diverse processes essential to living organisms, and thus numerous biomedical applications center on understanding, predicting, and designing how proteins recognize their partners. While unfortunately the number of interactions of interest still vastly exceeds the capabilities of experimental determination methods, computational methods promise to fill the gap. My thesis pursues the development and application of computational methods for several protein interaction prediction and design tasks. First, to improve protein-glycan interaction specificity prediction, I developed GlyBERT, which learns biologically relevant glycan representations encapsulating the components most important for glycan recognition within their structures. GlyBERT encodes glycans with a branched biochemical language and employs an attention-based deep language model to embed the correlation between local and global structural contexts. This approach enables the development of predictive models from limited data, supporting applications such as lectin binding prediction. Second, to improve protein-protein interaction prediction, I developed a unified geometric deep neural network, ‘PInet’ (Protein Interface Network), which leverages the best properties of both data- and physics-driven methods, learning and utilizing models capturing both geometrical and physicochemical molecular surface complementarity. In addition to obtaining state-of-the-art performance in predicting protein-protein interactions, PInet can serve as the backbone for other protein-protein interaction modeling tasks such as binding affinity prediction. Finally, I turned from ii prediction to design, addressing two important tasks in the context of antibodyantigen recognition. The first problem is to redesign a given antigen to evade antibody recognition, e.g., to help biotherapeutics avoid pre-existing immunity or to focus vaccine responses on key portions of an antigen. The second problem is to design a panel of variants of a given antigen to use as “bait” in experimental identification of antibodies that recognize different parts of the antigen, e.g., to support classification of immune responses or to help select among different antibody candidates. I developed a geometry-based algorithm to generate variants to address these design problems, seeking to maximize utility subject to experimental constraints. During the design process, the algorithm accounts for and balances the effects of candidate mutations on antibody recognition and on antigen stability. In retrospective case studies, the algorithm demonstrated promising precision, recall, and robustness of finding good designs. This work represents the first algorithm to systematically design antigen variants for characterization and evasion of polyclonal antibody responses
- …