40 research outputs found
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
Multivariate Analysis of Tumour Gene Expression Profiles Applying Regularisation and Bayesian Variable Selection Techniques
High-throughput microarray technology is here to stay, e.g. in oncology for tumour classification
and gene expression profiling to predict cancer pathology and clinical outcome. The global
objective of this thesis is to investigate multivariate methods that are suitable for this task.
After introducing the problem and the biological background, an overview of multivariate
regularisation methods is given in Chapter 3 and the binary classification problem is outlined
(Chapter 4). The focus of applications presented in Chapters 5 to 7 is on sparse binary classifiers
that are both parsimonious and interpretable. Particular emphasis is on sparse penalised
likelihood and Bayesian variable selection models, all in the context of logistic regression. The
thesis concludes with a final discussion chapter.
The variable selection problem is particularly challenging here, since the number of variables
is much larger than the sample size, which results in an ill-conditioned problem with
many equally good solutions. Thus, one open problem is the stability of gene expression profiles.
In a resampling study, various characteristics including stability are compared between a
variety of classifiers applied to five gene expression data sets and validated on two independent
data sets.
Bayesian variable selection provides an alternative to resampling for estimating the uncertainty
in the selection of genes. MCMC methods are used for model space exploration, but
because of the high dimensionality standard algorithms are computationally expensive and/or
result in poor Markov chain mixing. A novel MCMC algorithm is presented that uses the
dependence structure between input variables for finding blocks of variables to be updated together.
This drastically improves mixing while keeping the computational burden acceptable.
Several algorithms are compared in a simulation study. In an ovarian cancer application in
Chapter 7, the best-performing MCMC algorithms are combined with parallel tempering and
compared with an alternative method
Integration and visualisation of clinical-omics datasets for medical knowledge discovery
In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult.
Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration.
Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research.
Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces
Robust and efficient approach to feature selection with machine learning
Most statistical analyses or modelling studies must deal with the discrepancy between the measured aspects of analysed phenomenona and their true nature. Hence, they are often preceded by a step of altering the data representation into somehow optimal for the following methods.This thesis deals with feature selection, a narrow yet important subset of representation altering methodologies.Feature selection is applied to an information system, i.e., data existing in a tabular form, as a group of objects characterised by values of some set of attributes (also called features or variables), and is defined as a process of finding a strict subset of them which fulfills some criterion.There are two essential classes of feature selection methods: minimal optimal, which aim to find the smallest subset of features that optimise accuracy of certain modelling methods, and all relevant, which aim to find the entire set of features potentially usable for modelling. The first class is mostly used in practice, as it adheres to a well known optimisation problem and has a direct connection to the final model performance. However, I argue that there exists a wide and significant class of applications in which only all relevant approaches may yield usable results, while minimal optimal methods are not only ineffective but even can lead to wrong conclusions.Moreover, all relevant class substantially overlaps with the set of actual research problems in which feature selection is an important result on its own, sometimes even more important than the finally resulting black-box model. In particular this applies to the p>>n problems, i.e., those for which the number of attributes is large and substantially exceeds the number of objects; for instance, such data is produced by high-throughput biological experiments which currently serve as the most powerful tool of molecular biology and a fundament of the arising individualised medicine.In the main part of the thesis I present Boruta, a heuristic, all relevant feature selection method. It is based on the concept of shadows, by-design random attributes incorporated into the information system as a reference for the relevance of original features in the context of whole structure of the analysed data. The variable importance on its own is assessed using the Random Forest method, a popular ensemble classifier.As the performance of the Boruta method turns out insatisfactory for some important applications, the following chapters of the thesis are devoted to Random Ferns, an ensemble classifier with the structure similar to Random Forest, but of a substantially higher computational efficiency. In the thesis, I propose a substantial generalisation of this method, capable of training on generic data and calculating feature importance scores.Finally, I assess both the Boruta method and its Random Ferns-based derivative on a series of p>>n problems of a biological origin. In particular, I focus on the stability of feature selection; I propose a novel methodology based on bootstrap and self-consistency. The results I obtain empirically confirm the validity of aforementioned effects characteristic to minimal optimal selection, as well as the efficiency of proposed heuristics for all relevant selection.The thesis is completed with a study of the applicability of Random Ferns in musical information retrieval, showing the usefulness of this method in other contexts and proposing its generalisation for multi-label classification problems.W wi臋kszo艣ci zagadnie艅 statystycznego modelowania istnieje problem niedostosowania zebranych danych do natury badanego zjawiska; co za tym idzie, analiza danych jest zazwyczaj poprzedzona zmian膮 ich surowej formy w optymaln膮 dla dalej stosowanych metod.W rozprawie zajmuj臋 si臋 selekcj膮 cech, jedn膮 z klas zabieg贸w zmiany formy danych. Dotyczy ona system贸w informacyjnych, czyli danych daj膮cych si臋 przedstawi膰 w formie tabelarycznej jako zbi贸r obiekt贸w opisanych przez warto艣ci zbioru atrybut贸w (nazywanych te偶 cechami), oraz jest zdefiniowana jako proces wydzielenia w jakim艣 sensie optymalnego podzbioru atrybut贸w.Wyr贸偶nia si臋 dwie zasadnicze grupy metod selekcji cech: poszukuj膮cych mo偶liwie ma艂ego podzbioru cech zapewniaj膮cego mo偶liwie dobr膮 dok艂adno艣膰 jakiej艣 metody modelowania (minimal optimal) oraz poszukuj膮cych podzbioru wszystkich cech, kt贸re nios膮 istotn膮 informacj臋 i przez to s膮 potencjalnie u偶yteczne dla jakiej艣 metody modelowania (all relevant). Tradycyjnie stosuje si臋 prawie wy艂膮cznie metody minimal optimal, sprowadzaj膮 si臋 one bowiem w prosty spos贸b do znanego problemu optymalizacji i maj膮 bezpo艣redni zwi膮zek z efektywno艣ci膮 finalnego modelu. W rozprawie argumentuj臋 jednak, 偶e istnieje szeroka i istotna klasa problem贸w, w kt贸rych tylko metody all relevant pozwalaj膮 uzyska膰 u偶yteczne wyniki, a metody minimal optimal s膮 nie tylko nieefektywne ale cz臋sto prowadz膮 do mylnych wniosk贸w. Co wi臋cej, wspomniana klasa pokrywa si臋 te偶 w du偶ej mierze ze zbiorem faktycznych problem贸w w kt贸rych selekcja cech jest sama w sobie u偶ytecznym wynikiem, nierzadko wa偶niejszym nawet od uzyskanego modelu. W szczeg贸lno艣ci chodzi tu o zbiory klasy p>>n, to jest takie w kt贸rych liczba atrybut贸w w~systemie informacyjnym jest du偶a i znacz膮co przekracza liczb臋 obiekt贸w; dane takie powszechnie wyst臋puj膮 chocia偶by w wysokoprzepustowych badaniach biologicznych, b臋d膮cych obecnie najpot臋偶niejszym narz臋dziem analitycznym biologii molekularnej jak i fundamentem rodz膮cej si臋 zindywidualizowanej medycyny.W zasadniczej cz臋艣ci rozprawy prezentuj臋 metod臋 Boruta, heurystyczn膮 metod臋 selekcji zmiennych. Jest ona oparta o koncepcj臋 rozszerzania systemu informacyjnego o cienie, z definicji nieistotne atrybuty wytworzone z oryginalnych cech przez losow膮 permutacj臋 warto艣ci, kt贸re s膮 wykorzystywane jako odniesienie dla oceny istotno艣ci oryginalnych atrybut贸w w kontek艣cie pe艂nej struktury analizowanych danych. Do oceny wa偶no艣ci cech metoda wykorzystuje algorytm lasu losowego (Random Forest), popularny klasyfikator zespo艂owy.Poniewa偶 wydajno艣膰 obliczeniowa metody Boruta mo偶e by膰 niewystarczaj膮ca dla pewnych istotnych zastosowa艅, w dalszej cz臋艣ci rozprawy zajmuj臋 si臋 algorytmem paproci losowych, klasyfikatorem zespo艂owym zbli偶onym struktur膮 do algorytmu lasu losowego, lecz oferuj膮cym znacz膮co lepsz膮 wydajno艣膰 obliczeniow膮. Proponuj臋 uog贸lnienie tej metody, zdolne do treningu na generycznych systemach informacyjnych oraz do obliczania miary wa偶no艣ci atrybut贸w.Zar贸wno metod臋 Boruta jak i jej modyfikacj臋 wykorzystuj膮c膮 paprocie losowe poddaj臋 w rozprawie wyczerpuj膮cej analizie na szeregu zbior贸w klasy p>>n pochodzenia biologicznego. W szczeg贸lno艣ci rozwa偶am tu stabilno艣膰 selekcji; w tym celu formu艂uj臋 now膮 metod臋 oceny opart膮 o podej艣cie resamplingowe i samozgodno艣膰 wynik贸w. Wyniki przeprowadzonych eksperyment贸w potwierdzaj膮 empirycznie zasadno艣膰 wspomnianych wcze艣niej problem贸w zwi膮zanych z selekcj膮 minimal optimal, jak r贸wnie偶 zasadno艣膰 przyj臋tych heurystyk dla selekcji all relevant.Rozpraw臋 dope艂nia studium stosowalno艣ci algorytmu paproci losowych w problemie rozpoznawania instrument贸w muzycznych w nagraniach, ilustruj膮ce przydatno艣膰 tej metody w innych kontekstach i proponuj膮ce jej uog贸lnienie na klasyfikacj臋 wieloetykietow膮
Developing statistical and bioinformatic analysis of genomic data from tumours
Previous prognostic signatures for melanoma based on tumour transcriptomic data were developed predominantly on cohorts of AJCC (American Joint Committee on Cancer) stages III and IV melanoma. Since 92% of melanoma patients are diagnosed at AJCC stages I and II, there is an urgent need for better prognostic biomarkers to allow patient stratification for receiving early adjuvant therapies.
This study uses genome-wide tumour gene expression levels and clinico-histopathological characteristics of patients from the Leeds Melanoma Cohort (LMC). Several unsupervised and supervised classification approaches were applied to the transcriptomic data, to identify biological classes of melanoma, and to develop prognostic classification models respectively.
Unsupervised clustering identified six biologically distinct primary melanoma classes (LMC classes). Unlike previous molecular classes of melanoma, the LMC classes were prognostic in both the whole LMC dataset and in stage I tumours. The prognostic value of the LMC classes was replicated in an independent dataset, but insufficient data were available to replicate in an AJCC stage I subset.
Supervised classification using the Random Forest (RF) approach provided improved performances when adjustments were made to deal with class imbalance, while this did not improve performance of the Support Vector Machine (SVM). However, RF and SVM had similar results overall, with RF only marginally better. Combining clinical and transcriptomic information in the RF further improved the performance of the prediction model in comparison to using clinical information alone. Finally, the agnostically derived LMC classes and the supervised RF model showed convergence in their association with outcome in some groups of patients, but not in others.
In conclusion, this study reports six molecular classes of primary melanoma with prognostic value in stage I disease and overall, and a prognostic classification model that predicts outcome in primary melanoma
Novel methods for multi-view learning with applications in cyber security
Modern data is complex. It exists in many different forms, shapes and kinds. Vectors, graphs, histograms, sets, intervals, etc.: they each have distinct and varied structural properties. Tailoring models to the characteristics of various feature representations has been the subject of considerable research. In this thesis, we address the challenge of learning from data that is described by multiple heterogeneous feature representations.
This situation arises often in cyber security contexts. Data from a computer network can be represented by a graph of user authentications, a time series of network traffic, a tree of process events, etc. Each representation provides a complementary view of the holistic state of the network, and so data of this type is referred to as multi-view data. Our motivating problem in cyber security is anomaly detection: identifying unusual observations in a joint feature space, which may not appear anomalous marginally.
Our contributions include the development of novel supervised and unsupervised methods, which are applicable not only to cyber security but to multi-view data in general. We extend the generalised linear model to operate in a vector-valued reproducing kernel Hilbert space implied by an operator-valued kernel function, which can be tailored to the structural characteristics of multiple views of data. This is a highly flexible algorithm, able to predict a wide variety of response types. A distinguishing feature is the ability to simultaneously identify outlier observations with respect to the fitted model. Our proposed unsupervised learning model extends multidimensional scaling to directly map multi-view data into a shared latent space. This vector embedding captures both commonalities and disparities that exist between multiple views of the data. Throughout the thesis, we demonstrate our models using real-world cyber security datasets.Open Acces
Handling Class Imbalance Using Swarm Intelligence Techniques, Hybrid Data and Algorithmic Level Solutions
This research focuses mainly on the binary class imbalance problem in data mining. It investigates the use of combined approaches of data and algorithmic level solutions. Moreover, it examines the use of swarm intelligence and population-based techniques to combat the class imbalance problem at all levels, including at the data, algorithmic, and feature level. It also introduces various solutions to the class imbalance problem, in which swarm intelligence techniques like Stochastic Diffusion Search (SDS) and Dispersive Flies Optimisation (DFO) are used. The algorithms were evaluated using experiments on imbalanced datasets, in which the Support Vector Machine (SVM) was used as a classifier. SDS was used to perform informed undersampling of the majority class to balance the dataset. The results indicate that this algorithm improves the classifier performance and can be used on imbalanced datasets. Moreover, SDS was extended further to perform feature selection on high dimensional datasets. Experimental results show that SDS can be used to perform feature selection and improve the classifier performance on imbalanced datasets. Further experiments evaluated DFO as an algorithmic level solution to optimise the SVM kernel parameters when learning from imbalanced datasets. Based on the promising results of DFO in these experiments, the novel approach was extended further to provide a hybrid algorithm that simultaneously optimises the kernel parameters and performs feature selection
Computational investigation of systemic pathway responses in severe pneumonia among the Gambian children and infants
Pneumonia remains the leading cause of infectious mortality in under-five children,
and the burden is highest in sub-Saharan Africa. To mitigate this burden, further
knowledge is required to accelerate the development of innovative and cost-effective
approaches. To gain a deeper insight into the pathogenesis of pneumonia,
I investigated the central hypothesis that systemic pathway (cellular and molecular)
responses underpin the development of severe pneumonia outcomes.
Mainly, I compared whole blood transcriptomes between severe pneumonia cases
(clinically stratified as mild, severe and very severe) and non-pneumonia community
controls (prospectively matched by age and sex). In total, 803 whole blood RNA
samples were collected from Gambian children (aged 2-59 months) between 2007
and 2010, of which, 518 passed laboratory quality control criteria for the microarray
analysis. After data cleaning, the final database reduced to 503 samples including
the training (n=345) and independent validation (n=158) data sets.
To investigate the cellular responses, I applied computational deconvolution
analysis to assess the variations of immune cell type proportions with pneumonia
severity. To further enhance the computational performance, I applied a data fusion
approach on 3,475 immune marker genes from different resources to derive an
optimal and integrated blood marker list (IBML, m=277) for Neutrophils, Monocytes,
NK, Dendritic, B and T cell types; which robustly performed better than the existing
individual resources. Using the IBML resource, pneumonia severity was significantly
associated with the depletion of B, T, Dendritic and NK cell types, and the elevation
of Monocytes and neutrophil proportions (P-value<0.001).
At the molecular level, pneumonia severity was associated (false discovery
rate<0.05) with a battery of systemic pathway (innate, adaptive and metabolic)
responses in a range of biomedical databases. While the up-regulation of
inflammatory innate responses was also observed in mild cases, severe pneumonia
cases were predominantly associated with the co-inhibition of the cells of the
adaptive immune response (B and T) and Natural killer cells, and the up-regulation
of fatty acid and lipid metabolism. While most of these findings were anticipated, the
involvement of NK cells was unexpected, and potentially presents a novel immune-modulation
target for mitigating the burden of pneumonia. Together, the cellular and
molecular pathways responses consistently support the central hypothesis that
systemic pathway responses contribute significantly to the development of severe
pneumonia outcomes.
Clinically, the identification and appropriate treatment of patients at the higher risk of
developing severe pneumonia outcomes remains the major challenge. To address
that, I applied supervised machine-learning approaches on cellular pathway based
transcriptomic features; and derived a 33-gene classifier (representing the NK, T,
and neutrophils cell types), which accurately detected severe pneumonia cases in
both the training (leave-one-out cross-validated accuracy=99%) and independent
validation (accuracy=98%) datasets. Independently, similar performance (98% in
each dataset) was associated with a subset (m=18) of the validated 52-gene
neonatal sepsis classifier. Conversely, at least 75% of the cellular biomarkers were
differentially expressed (false discovery rate<0.05) in bacterial neonatal sepsis.
Further, very severe pneumonia cases were predominantly associated with
antibacterial responses; and mild pneumonia cases with blood-culture-confirmed
positivity were also associated with an increased frequency of differentially
expressed genes. These findings suggest the significant contribution of bacterial
septicaemia in the development of serious pneumonia outcomes. Together, this
study highlights the future potential of host-derived systemic biomarkers for early
identification and novel treatment modalities of high-risk cases presenting at a
resource-constrained clinic with mild pneumonia. However, further validation studies
are required