27 research outputs found

    Towards large scale continuous EDA: a random matrix theory perspective

    Get PDF
    Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems. Large scale continuous global optimisation is key to many real world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divide-and-conquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of non-asymptotic random matrix theory

    Robust adaptive Lasso in high-dimensional logistic regression with an application to genomic classification of cancer patients

    Get PDF
    Penalized logistic regression is extremely useful for binary classiffication with a large number of covariates (significantly higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood based loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its in uence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we clearly demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification

    Computational investigation of systemic pathway responses in severe pneumonia among the Gambian children and infants

    Get PDF
    Pneumonia remains the leading cause of infectious mortality in under-five children, and the burden is highest in sub-Saharan Africa. To mitigate this burden, further knowledge is required to accelerate the development of innovative and cost-effective approaches. To gain a deeper insight into the pathogenesis of pneumonia, I investigated the central hypothesis that systemic pathway (cellular and molecular) responses underpin the development of severe pneumonia outcomes. Mainly, I compared whole blood transcriptomes between severe pneumonia cases (clinically stratified as mild, severe and very severe) and non-pneumonia community controls (prospectively matched by age and sex). In total, 803 whole blood RNA samples were collected from Gambian children (aged 2-59 months) between 2007 and 2010, of which, 518 passed laboratory quality control criteria for the microarray analysis. After data cleaning, the final database reduced to 503 samples including the training (n=345) and independent validation (n=158) data sets. To investigate the cellular responses, I applied computational deconvolution analysis to assess the variations of immune cell type proportions with pneumonia severity. To further enhance the computational performance, I applied a data fusion approach on 3,475 immune marker genes from different resources to derive an optimal and integrated blood marker list (IBML, m=277) for Neutrophils, Monocytes, NK, Dendritic, B and T cell types; which robustly performed better than the existing individual resources. Using the IBML resource, pneumonia severity was significantly associated with the depletion of B, T, Dendritic and NK cell types, and the elevation of Monocytes and neutrophil proportions (P-value<0.001). At the molecular level, pneumonia severity was associated (false discovery rate<0.05) with a battery of systemic pathway (innate, adaptive and metabolic) responses in a range of biomedical databases. While the up-regulation of inflammatory innate responses was also observed in mild cases, severe pneumonia cases were predominantly associated with the co-inhibition of the cells of the adaptive immune response (B and T) and Natural killer cells, and the up-regulation of fatty acid and lipid metabolism. While most of these findings were anticipated, the involvement of NK cells was unexpected, and potentially presents a novel immune-modulation target for mitigating the burden of pneumonia. Together, the cellular and molecular pathways responses consistently support the central hypothesis that systemic pathway responses contribute significantly to the development of severe pneumonia outcomes. Clinically, the identification and appropriate treatment of patients at the higher risk of developing severe pneumonia outcomes remains the major challenge. To address that, I applied supervised machine-learning approaches on cellular pathway based transcriptomic features; and derived a 33-gene classifier (representing the NK, T, and neutrophils cell types), which accurately detected severe pneumonia cases in both the training (leave-one-out cross-validated accuracy=99%) and independent validation (accuracy=98%) datasets. Independently, similar performance (98% in each dataset) was associated with a subset (m=18) of the validated 52-gene neonatal sepsis classifier. Conversely, at least 75% of the cellular biomarkers were differentially expressed (false discovery rate<0.05) in bacterial neonatal sepsis. Further, very severe pneumonia cases were predominantly associated with antibacterial responses; and mild pneumonia cases with blood-culture-confirmed positivity were also associated with an increased frequency of differentially expressed genes. These findings suggest the significant contribution of bacterial septicaemia in the development of serious pneumonia outcomes. Together, this study highlights the future potential of host-derived systemic biomarkers for early identification and novel treatment modalities of high-risk cases presenting at a resource-constrained clinic with mild pneumonia. However, further validation studies are required

    Enhanced label noise filtering with multiple voting

    Get PDF
    Š 2019 by the authors. Label noises exist in many applications, and their presence can degrade learning performance. Researchers usually use filters to identify and eliminate them prior to training. The ensemble learning based filter (EnFilter) is the most widely used filter. According to the voting mechanism, EnFilter is mainly divided into two types: single-voting based (SVFilter) and multiple-voting based (MVFilter). In general, MVFilter is more often preferred because multiple-voting could address the intrinsic limitations of single-voting. However, the most important unsolved issue in MVFilter is how to determine the optimal decision point (ODP). Conceptually, the decision point is a threshold value, which determines the noise detection performance. To maximize the performance of MVFilter, we propose a novel approach to compute the optimal decision point. Our approach is data driven and cost sensitive, which determines the ODP based on the given noisy training dataset and noise misrecognition cost matrix. The core idea of our approach is to estimate the mislabeled data probability distributions, based on which the expected cost of each possible decision point could be inferred. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach

    Mathematical and statistical methods for single cell data

    Get PDF
    The availability of single-cell data has increased rapidly in recent years and presents interesting new challenges in the analysis of such data and the modelling of the processes that generate it. In this thesis, we attempt to deal with some of those challenges by developing and exploring mathematical and statistical models for the evolution of population distributions over time, and methods for using aggregated single-cell data from individual patients in predictive diagnostic models of disease. In the first part of the thesis, we explore structured population models – a class of partial differential equations for describing the evolution of individual-level cell properties in a population over time. We begin by analysing an age-structured model of cell growth in which rates of proliferation and cell death are controlled by an external resource. We follow this with a method for extracting properties of a more general class of structured population models directly from single-cell data. In the final part of the thesis, we develop a flexible Bayesian statistical framework for building predictive models from possibly high-dimensional data collected from patients using single-cell technologies and find that the performance is promising compared to a number of existing methods
    corecore