530 research outputs found
Recommended from our members
Evolutionary and deep mining models for effective biomarker discovery
With the advent of high-throughput biology, large amounts of molecular data are available for purposeful analysis and evaluation. Extracting relevant knowledge from high-throughput biomedical datasets has become a common goal of current approaches to personalised cancer medicine and understanding cancer genotype and phenotype. However, the datasets are characterised by high dimensionality and relatively small sample sizes with small signal-to-noise ratios. Extracting and interpreting relevant knowledge from such complex datasets therefore remains a significant challenge for the fields of machine learning and data mining. This is evidenced by the limited success these methods have had in detecting robust and reliable biomarkers for cancers and other complicated diseases. This could also explain the lack of finding generic biomarkers among the identified published genes for identical diseases or clinical conditions.
This thesis proposes and evaluates the efficacy of two novel feature mining models established on the basis of the evolutionary computation and deep learning paradigms to position and solve biomarker discovery as an optimisation problem. Deep learning methods lack the transparency and interpretability found in the evolutionary paradigm. To overcome the inherent issue of poor explanatory power associated with the deep learning, this research also introduces a novel deep mining model that helps to deconstruct the internal state of such deep learning models to reveal key determinants underlying its latent representations to aid feature selection. As a result, salient biomarkers for breast cancer and the positivity of the Estrogen and Progesterone receptors are discovered robustly and validated reliably across a wide range of independently generated breast cancer data samples
Identification of Single- and Multiple-Class Specific Signature Genes from Gene Expression Profiles by Group Marker Index
Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases
Gene Regulatory Network Analysis and Web-based Application Development
Microarray data is a valuable source for gene regulatory network analysis. Using earthworm microarray data analysis as an example, this dissertation demonstrates that a bioinformatics-guided reverse engineering approach can be applied to analyze time-series data to uncover the underlying molecular mechanism. My network reconstruction results reinforce previous findings that certain neurotransmitter pathways are the target of two chemicals - carbaryl and RDX. This study also concludes that perturbations to these pathways by sublethal concentrations of these two chemicals were temporary, and earthworms were capable of fully recovering. Moreover, differential networks (DNs) analysis indicates that many pathways other than those related to synaptic and neuronal activities were altered during the exposure phase.
A novel differential networks (DNs) approach is developed in this dissertation to connect pathway perturbation with toxicity threshold setting from Live Cell Array (LCA) data. Findings from this proof-of-concept study suggest that this DNs approach has a great potential to provide a novel and sensitive tool for threshold setting in chemical risk assessment. In addition, a web-based tool “Web-BLOM” was developed for the reconstruction of gene regulatory networks from time-series gene expression profiles including microarray and LCA data. This tool consists of several modular components: a database, the gene network reconstruction model and a user interface. The Bayesian Learning and Optimization Model (BLOM), originally implemented in MATLAB, was adopted by Web-BLOM to provide an online reconstruction of large-scale gene regulation networks. Compared to other network reconstruction models, BLOM can infer larger networks with compatible accuracy, identify hub genes and is much more computationally efficient
Navigating the Human Metabolome for Biomarker Identification and Design of Pharmaceutical Molecules
Metabolomics is a rapidly evolving discipline that involves the systematic study of endogenous small molecules that characterize the metabolic pathways of biological systems. The study of metabolism at a global level has the potential to contribute significantly to biomedical research, clinical medical practice, as well as drug discovery. In this paper, we present the most up-to-date metabolite and metabolic pathway resources, and we summarize the statistical, and machine-learning tools used for the analysis of data from clinical metabolomics. Through specific applications on cancer, diabetes, neurological and other diseases, we demonstrate how these tools can facilitate diagnosis and identification of potential biomarkers for use within disease diagnosis. Additionally, we discuss the increasing importance of the integration of metabolomics data in drug discovery. On a case-study based on the Human Metabolome Database (HMDB) and the Chinese Natural Product Database (CNPD), we demonstrate the close relatedness of the two data sets of compounds, and we further illustrate how structural similarity with human metabolites could assist in the design of novel pharmaceuticals and the elucidation of the molecular mechanisms of medicinal plants
Large-scale dimensionality reduction using perturbation theory and singular vectors
Massive volumes of high-dimensional data have become pervasive, with the number
of features significantly exceeding the number of samples in many applications.
This has resulted in a bottleneck for data mining applications and amplified the
computational burden of machine learning algorithms that perform classification or
pattern recognition. Dimensionality reduction can handle this problem in two ways,
i.e. feature selection (FS) and feature extraction. In this thesis, we focus on FS, because,
in many applications like bioinformatics, the domain experts need to validate
a set of original features to corroborate the hypothesis of the prediction models. In
processing the high-dimensional data, FS mainly involves detecting a limited number
of important features among tens/hundreds of thousands of irrelevant and redundant
features.
We start with filtering the irrelevant features using our proposed Sparse Least
Squares (SLS) method, where a score is assigned to each feature, and the low-scoring
features are removed using a soft threshold. To demonstrate the effectiveness of SLS,
we used it to augment the well-known FS methods, thereby achieving substantially
reduced running times while improving or at least maintaining the prediction accuracy
of the models.
We developed a linear FS method (DRPT) which, upon data reduction by SLS,
clusters the reduced data using the perturbation theory to detect correlations between
the remaining features. Important features are ultimately selected from each cluster,
discarding the redundant features.
To extend the clustering applicability in grouping the redundant features, we
proposed a new Singular Vectors FS (SVFS) method that is capable of both removing
the irrelevant features and effectively clustering the remaining features. As such,
the features in each cluster solely exhibit inner correlations with each other. The
independently selected important features from different clusters comprise the final
rank. Devising thresholds for filtering irrelevant and redundant features has facilitated
the adaptability of our model to the particular needs of various applications.
A comprehensive evaluation based on benchmark biological and image datasets
shows the superiority of our proposed methods compared to the state-of-the-art FS
methods in terms of classification accuracy, running time, and memory usage
Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research
By promising more accurate diagnostics and individual treatment
recommendations, deep neural networks and in particular convolutional neural
networks have advanced to a powerful tool in medical imaging. Here, we first
give an introduction into methodological key concepts and resulting
methodological promises including representation and transfer learning, as well
as modelling domain-specific priors. After reviewing recent applications within
neuroimaging-based psychiatric research, such as the diagnosis of psychiatric
diseases, delineation of disease subtypes, normative modeling, and the
development of neuroimaging biomarkers, we discuss current challenges. This
includes for example the difficulty of training models on small, heterogeneous
and biased data sets, the lack of validity of clinical labels, algorithmic
bias, and the influence of confounding variables
- …