1,222 research outputs found

    Elephant Search with Deep Learning for Microarray Data Analysis

    Full text link
    Even though there is a plethora of research in Microarray gene expression data analysis, still, it poses challenges for researchers to effectively and efficiently analyze the large yet complex expression of genes. The feature (gene) selection method is of paramount importance for understanding the differences in biological and non-biological variation between samples. In order to address this problem, a novel elephant search (ES) based optimization is proposed to select best gene expressions from the large volume of microarray data. Further, a promising machine learning method is envisioned to leverage such high dimensional and complex microarray dataset for extracting hidden patterns inside to make a meaningful prediction and most accurate classification. In particular, stochastic gradient descent based Deep learning (DL) with softmax activation function is then used on the reduced features (genes) for better classification of different samples according to their gene expression levels. The experiments are carried out on nine most popular Cancer microarray gene selection datasets, obtained from UCI machine learning repository. The empirical results obtained by the proposed elephant search based deep learning (ESDL) approach are compared with most recent published article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl

    Algebraic Comparison of Partial Lists in Bioinformatics

    Get PDF
    The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

    Feature selection and modelling methods for microarray data from acute coronary syndrome

    Get PDF
    Acute coronary syndrome (ACS) represents a leading cause of mortality and morbidity worldwide. Providing better diagnostic solutions and developing therapeutic strategies customized to the individual patient represent societal and economical urgencies. Progressive improvement in diagnosis and treatment procedures require a thorough understanding of the underlying genetic mechanisms of the disease. Recent advances in microarray technologies together with the decreasing costs of the specialized equipment enabled affordable harvesting of time-course gene expression data. The high-dimensional data generated demands for computational tools able to extract the underlying biological knowledge. This thesis is concerned with developing new methods for analysing time-course gene expression data, focused on identifying differentially expressed genes, deconvolving heterogeneous gene expression measurements and inferring dynamic gene regulatory interactions. The main contributions include: a novel multi-stage feature selection method, a new deconvolution approach for estimating cell-type specific signatures and quantifying the contribution of each cell type to the variance of the gene expression patters, a novel approach to identify the cellular sources of differential gene expression, a new approach to model gene expression dynamics using sums of exponentials and a novel method to estimate stable linear dynamical systems from noisy and unequally spaced time series data. The performance of the proposed methods was demonstrated on a time-course dataset consisting of microarray gene expression levels collected from the blood samples of patients with ACS and associated blood count measurements. The results of the feature selection study are of significant biological relevance. For the first time is was reported high diagnostic performance of the ACS subtypes up to three months after hospital admission. The deconvolution study exposed features of within and between groups variation in expression measurements and identified potential cell type markers and cellular sources of differential gene expression. It was shown that the dynamics of post-admission gene expression data can be accurately modelled using sums of exponentials, suggesting that gene expression levels undergo a transient response to the ACS events before returning to equilibrium. The linear dynamical models capturing the gene regulatory interactions exhibit high predictive performance and can serve as platforms for system-level analysis, numerical simulations and intervention studies

    Unbalanced data processing using oversampling: machine Learning

    Get PDF
    Nowadays, the DL algorithms show good results when used in the solution of different problems which present similar characteristics as the great amount of data and high dimensionality. However, one of the main challenges that currently arises is the classification of high dimensionality databases, with very few samples and high-class imbalance. Biomedical databases of gene expression microarrays present the characteristics mentioned above, presenting problems of class imbalance, with few samples and high dimensionality. The problem of class imbalance arises when the set of samples belonging to one class is much larger than the set of samples of the other class or classes. This problem has been identified as one of the main challenges of the algorithms applied in the context of Big Data. The objective of this research is the study of genetic expression databases, using conventional methods of sub and oversampling for the balance of classes such as RUS, ROS and SMOTE. The databases were modified by applying an increase in their imbalance and in another case generating artificial noise

    Statistical techniques to construct assays for identifying likely responders to a treatment under evaluation from cell line genomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Developing the right drugs for the right patients has become a mantra of drug development. In practice, it is very difficult to identify subsets of patients who will respond to a drug under evaluation. Most of the time, no single diagnostic will be available, and more complex decision rules will be required to define a sensitive population, using, for instance, mRNA expression, protein expression or DNA copy number. Moreover, diagnostic development will often begin with in-vitro cell-line data and a high-dimensional exploratory platform, only later to be transferred to a diagnostic assay for use with patient samples. In this manuscript, we present a novel approach to developing robust genomic predictors that are not only capable of generalizing from in-vitro to patient, but are also amenable to clinically validated assays such as qRT-PCR.</p> <p>Methods</p> <p>Using our approach, we constructed a predictor of sensitivity to dacetuzumab, an investigational drug for CD40-expressing malignancies such as lymphoma using genomic measurements of cell lines treated with dacetuzumab. Additionally, we evaluated several state-of-the-art prediction methods by independently pairing the feature selection and classification components of the predictor. In this way, we constructed several predictors that we validated on an independent DLBCL patient dataset. Similar analyses were performed on genomic measurements of breast cancer cell lines and patients to construct a predictor of estrogen receptor (ER) status.</p> <p>Results</p> <p>The best dacetuzumab sensitivity predictors involved ten or fewer genes and accurately classified lymphoma patients by their survival and known prognostic subtypes. The best ER status classifiers involved one or two genes and led to accurate ER status predictions more than 85% of the time. The novel method we proposed performed as well or better than other methods evaluated.</p> <p>Conclusions</p> <p>We demonstrated the feasibility of combining feature selection techniques with classification methods to develop assays using cell line genomic measurements that performed well in patient data. In both case studies, we constructed parsimonious models that generalized well from cell lines to patients.</p

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Computational Hybrid Systems for Identifying Prognostic Gene Markers of Lung Cancer

    Get PDF
    Lung cancer is the most fatal cancer around the world. Current lung cancer prognosis and treatment is based on tumor stage population statistics and could not reliably assess the risk for developing recurrence in individual patients. Biomarkers enable treatment options to be tailored to individual patients based on their tumor molecular characteristics. To date, there is no clinically applied molecular prognostic model for lung cancer. Statistics and feature selection methods identify gene candidates by ranking the association between gene expression and disease outcome, but do not account for the interactions among genes. Computational network methods could model interactions, but have not been used for gene selection due to computational inefficiency. Moreover, the curse of dimensionality in human genome data imposes more computational challenges to these methods.;We proposed two hybrid systems for the identification of prognostic gene signatures for lung cancer using gene expressions measured with DNA microarray. The first hybrid system combined t-tests, Statistical Analysis of Microarray (SAM), and Relief feature selections in multiple gene filtering layers. This combinatorial system identified a 12-gene signature with better prognostic performance than published signatures in treatment selection for stage I and II patients (log-rank P\u3c0.04, Kaplan-Meier analyses). The 12-gene signature is a more significant prognostic factor (hazard ratio=4.19, 95% CI: [2.08, 8.46], P\u3c0.00006) than other clinical covariates. The signature genes were found to be involved in tumorigenesis in functional pathway analyses.;The second proposed system employed a novel computational network model, i.e., implication networks based on prediction logic. This network-based system utilizes gene coexpression networks and concurrent coregulation with signaling pathways for biomarker identification. The first application of the system modeled disease-mediated genome-wide coexpression networks. The entire genomic space were extensively explored and 21 gene signatures were discovered with better prognostic performance than all published signatures in stage I patients not receiving chemotherapy (hazard ratio\u3e1, CPE\u3e0.5, P \u3c 0.05). These signatures could potentially be used for selecting patients for adjuvant chemotherapy. The second application of the system modeled the smoking-mediated coexpression networks and identified a smoking-associated 7-gene signature. The 7-gene signature generated significant prognostication specific to smoking lung cancer patients (log-rank P\u3c0.05, Kaplan-Meier analyses), with implications in diagnostic screening of lung cancer risk in smokers (overall accuracy=74%, P\u3c0.006). The coexpression patterns derived from the implication networks in both applications were successfully validated with molecular interactions reported in the literature (FDR\u3c0.1).;Our studies demonstrated that hybrid systems with multiple gene selection layers outperform traditional methods. Moreover, implication networks could efficiently model genome-scale disease-mediated coexpression networks and crosstalk with signaling pathways, leading to the identification of clinically important gene signatures

    A Machine Learning Classifier Trained on Cancer Transcriptomes Detects NF1 Inactivation Signal in Glioblastoma

    Get PDF
    We have identified molecules that exhibit synthetic lethality in cells with loss of the neurofibromin 1 (NF1) tumor suppressor gene. However, recognizing tumors that have inactivation of the NF1 tumor suppressor function is challenging because the loss may occur via mechanisms that do not involve mutation of the genomic locus. Degradation of the NF1 protein, independent of NF1 mutation status, phenocopies inactivating mutations to drive tumors in human glioma cell lines. NF1 inactivation may alter the transcriptional landscape of a tumor and allow a machine learning classifier to detect which tumors will benefit from synthetic lethal molecules. We developed a strategy to predict tumors with low NF1 activity and hence tumors that may respond to treatments that target cells lacking NF1. Using RNAseq data from The Cancer Genome Atlas (TCGA), we trained an ensemble of 500 logistic regression classifiers that integrates mutation status with whole transcriptomes to predict NF1 inactivation in glioblastoma (GBM)

    Deep Learning Reveals Key Immunosuppression Genes and Distinct Immunotypes in Periodontitis

    Get PDF
    Background: Periodontitis is a chronic immuno-inflammatory disease characterized by inflammatory destruction of tooth-supporting tissues. Its pathogenesis involves a dysregulated local host immune response that is ineffective in combating microbial challenges. An integrated investigation of genes involved in mediating immune response suppression in periodontitis, based on multiple studies, can reveal genes pivotal to periodontitis pathogenesis. Here, we aimed to apply a deep learning (DL)-based autoencoder (AE) for predicting immunosuppression genes involved in periodontitis by integrating multiples omics datasets. Methods: Two periodontitis-related GEO transcriptomic datasets (GSE16134 and GSE10334) and immunosuppression genes identified from DisGeNET and HisgAtlas were included. Immunosuppression genes related to periodontitis in GSE16134 were used as input to build an AE, to identify the top disease-representative immunosuppression gene features. Using K-means clustering and ANOVA, immune subtype labels were assigned to disease samples and a support vector machine (SVM) classifier was constructed. This classifier was applied to a validation set (Immunosuppression genes related to periodontitis in GSE10334) for predicting sample labels, evaluating the accuracy of the AE. In addition, differentially expressed genes (DEGs), signaling pathways, and transcription factors (TFs) involved in immunosuppression and periodontitis were determined with an array of bioinformatics analysis. Shared DEGs common to DEGs differentiating periodontitis from controls and those differentiating the immune subtypes were considered as the key immunosuppression genes in periodontitis. Results: We produced representative molecular features and identified two immune subtypes in periodontitis using an AE. Two subtypes were also predicted in the validation set with the SVM classifier. Three “master” immunosuppression genes, PECAM1, FCGR3A, and FOS were identified as candidates pivotal to immunosuppressive mechanisms in periodontitis. Six transcription factors, NFKB1, FOS, JUN, HIF1A, STAT5B, and STAT4, were identified as central to the TFs-DEGs interaction network. The two immune subtypes were distinct in terms of their regulating pathways. Conclusion: This study applied a DL-based AE for the first time to identify immune subtypes of periodontitis and pivotal immunosuppression genes that discriminated periodontitis from the healthy. Key signaling pathways and TF-target DEGs that putatively mediate immune suppression in periodontitis were identified. PECAM1, FCGR3A, and FOS emerged as high-value biomarkers and candidate therapeutic targets for periodontitis
    corecore