27 research outputs found
Towards large scale continuous EDA: a random matrix theory perspective
Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems.
Large scale continuous global optimisation is key to many real world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field.
This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divide-and-conquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of non-asymptotic random matrix theory
Robust adaptive Lasso in high-dimensional logistic regression with an application to genomic classification of cancer patients
Penalized logistic regression is extremely useful for binary classiffication with a large number of covariates (significantly higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood based loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its in uence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we clearly demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification
Recommended from our members
Review of processing and analysis methods for DNA methylation array data
The promise of epigenome-wide association studies (EWAS) and cancer specific somatic changes in improving our understanding of cancer coupled with the decreasing cost and increasing coverage of DNA methylation microarrays, has brought about a surge in the use of these technologies. Here, we aim to provide both a review of issues encountered in the processing and analysis of array-based DNA methylation data, as well as to summarize advantages of recent approaches proposed for handling those issues; focusing on approaches publicly available in open-source environments such as R and Bioconductor. The processing tools and analysis flowchart described we hope will facilitate researchers to effectively use these powerful DNA methylation array-based platforms, thereby advancing our understanding of human health and disease.Keywords: Processing, Microarray, Analysis, DNA methylation, Bioconductor and R package
Computational investigation of systemic pathway responses in severe pneumonia among the Gambian children and infants
Pneumonia remains the leading cause of infectious mortality in under-five children,
and the burden is highest in sub-Saharan Africa. To mitigate this burden, further
knowledge is required to accelerate the development of innovative and cost-effective
approaches. To gain a deeper insight into the pathogenesis of pneumonia,
I investigated the central hypothesis that systemic pathway (cellular and molecular)
responses underpin the development of severe pneumonia outcomes.
Mainly, I compared whole blood transcriptomes between severe pneumonia cases
(clinically stratified as mild, severe and very severe) and non-pneumonia community
controls (prospectively matched by age and sex). In total, 803 whole blood RNA
samples were collected from Gambian children (aged 2-59 months) between 2007
and 2010, of which, 518 passed laboratory quality control criteria for the microarray
analysis. After data cleaning, the final database reduced to 503 samples including
the training (n=345) and independent validation (n=158) data sets.
To investigate the cellular responses, I applied computational deconvolution
analysis to assess the variations of immune cell type proportions with pneumonia
severity. To further enhance the computational performance, I applied a data fusion
approach on 3,475 immune marker genes from different resources to derive an
optimal and integrated blood marker list (IBML, m=277) for Neutrophils, Monocytes,
NK, Dendritic, B and T cell types; which robustly performed better than the existing
individual resources. Using the IBML resource, pneumonia severity was significantly
associated with the depletion of B, T, Dendritic and NK cell types, and the elevation
of Monocytes and neutrophil proportions (P-value<0.001).
At the molecular level, pneumonia severity was associated (false discovery
rate<0.05) with a battery of systemic pathway (innate, adaptive and metabolic)
responses in a range of biomedical databases. While the up-regulation of
inflammatory innate responses was also observed in mild cases, severe pneumonia
cases were predominantly associated with the co-inhibition of the cells of the
adaptive immune response (B and T) and Natural killer cells, and the up-regulation
of fatty acid and lipid metabolism. While most of these findings were anticipated, the
involvement of NK cells was unexpected, and potentially presents a novel immune-modulation
target for mitigating the burden of pneumonia. Together, the cellular and
molecular pathways responses consistently support the central hypothesis that
systemic pathway responses contribute significantly to the development of severe
pneumonia outcomes.
Clinically, the identification and appropriate treatment of patients at the higher risk of
developing severe pneumonia outcomes remains the major challenge. To address
that, I applied supervised machine-learning approaches on cellular pathway based
transcriptomic features; and derived a 33-gene classifier (representing the NK, T,
and neutrophils cell types), which accurately detected severe pneumonia cases in
both the training (leave-one-out cross-validated accuracy=99%) and independent
validation (accuracy=98%) datasets. Independently, similar performance (98% in
each dataset) was associated with a subset (m=18) of the validated 52-gene
neonatal sepsis classifier. Conversely, at least 75% of the cellular biomarkers were
differentially expressed (false discovery rate<0.05) in bacterial neonatal sepsis.
Further, very severe pneumonia cases were predominantly associated with
antibacterial responses; and mild pneumonia cases with blood-culture-confirmed
positivity were also associated with an increased frequency of differentially
expressed genes. These findings suggest the significant contribution of bacterial
septicaemia in the development of serious pneumonia outcomes. Together, this
study highlights the future potential of host-derived systemic biomarkers for early
identification and novel treatment modalities of high-risk cases presenting at a
resource-constrained clinic with mild pneumonia. However, further validation studies
are required
Enhanced label noise filtering with multiple voting
Š 2019 by the authors. Label noises exist in many applications, and their presence can degrade learning performance. Researchers usually use filters to identify and eliminate them prior to training. The ensemble learning based filter (EnFilter) is the most widely used filter. According to the voting mechanism, EnFilter is mainly divided into two types: single-voting based (SVFilter) and multiple-voting based (MVFilter). In general, MVFilter is more often preferred because multiple-voting could address the intrinsic limitations of single-voting. However, the most important unsolved issue in MVFilter is how to determine the optimal decision point (ODP). Conceptually, the decision point is a threshold value, which determines the noise detection performance. To maximize the performance of MVFilter, we propose a novel approach to compute the optimal decision point. Our approach is data driven and cost sensitive, which determines the ODP based on the given noisy training dataset and noise misrecognition cost matrix. The core idea of our approach is to estimate the mislabeled data probability distributions, based on which the expected cost of each possible decision point could be inferred. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach
Mathematical and statistical methods for single cell data
The availability of single-cell data has increased rapidly in recent years and presents interesting new challenges in the analysis of such data and the modelling of the processes that generate it. In this thesis, we attempt to deal with some of those challenges by developing and exploring mathematical and statistical models for the evolution of population distributions over time, and methods for using aggregated single-cell data from individual patients in predictive diagnostic models of disease. In the first part of the thesis, we explore structured population models â a class of partial differential equations for describing the evolution of individual-level cell properties in a population over time. We begin by analysing an age-structured model of cell growth in which rates of proliferation and cell death are controlled by an external resource. We follow this with a method for extracting properties of a more general class of structured population models directly from single-cell data. In the final part of the thesis, we develop a flexible Bayesian statistical framework for building predictive models from possibly high-dimensional data collected from patients using single-cell technologies and find that the performance is promising compared to a number of existing methods