65,079 research outputs found

    Feature selection and modelling methods for microarray data from acute coronary syndrome

    Get PDF
    Acute coronary syndrome (ACS) represents a leading cause of mortality and morbidity worldwide. Providing better diagnostic solutions and developing therapeutic strategies customized to the individual patient represent societal and economical urgencies. Progressive improvement in diagnosis and treatment procedures require a thorough understanding of the underlying genetic mechanisms of the disease. Recent advances in microarray technologies together with the decreasing costs of the specialized equipment enabled affordable harvesting of time-course gene expression data. The high-dimensional data generated demands for computational tools able to extract the underlying biological knowledge. This thesis is concerned with developing new methods for analysing time-course gene expression data, focused on identifying differentially expressed genes, deconvolving heterogeneous gene expression measurements and inferring dynamic gene regulatory interactions. The main contributions include: a novel multi-stage feature selection method, a new deconvolution approach for estimating cell-type specific signatures and quantifying the contribution of each cell type to the variance of the gene expression patters, a novel approach to identify the cellular sources of differential gene expression, a new approach to model gene expression dynamics using sums of exponentials and a novel method to estimate stable linear dynamical systems from noisy and unequally spaced time series data. The performance of the proposed methods was demonstrated on a time-course dataset consisting of microarray gene expression levels collected from the blood samples of patients with ACS and associated blood count measurements. The results of the feature selection study are of significant biological relevance. For the first time is was reported high diagnostic performance of the ACS subtypes up to three months after hospital admission. The deconvolution study exposed features of within and between groups variation in expression measurements and identified potential cell type markers and cellular sources of differential gene expression. It was shown that the dynamics of post-admission gene expression data can be accurately modelled using sums of exponentials, suggesting that gene expression levels undergo a transient response to the ACS events before returning to equilibrium. The linear dynamical models capturing the gene regulatory interactions exhibit high predictive performance and can serve as platforms for system-level analysis, numerical simulations and intervention studies

    Machine learning and soft computing approaches to microarray differential expression analysis and feature selection.

    Get PDF
    Differential expression analysis and feature selection is central to gene expression microarray data analysis. Standard approaches are flawed with the arbitrary assignment of cut-off parameters and the inability to adapt to the particular data set under analysis. Presented in this thesis are three novel approaches to microarray data feature selection and differential expression analysis based on various machine learning and soft computing paradigms. The first approach uses a Separability Index to select ranked genes, making gene selection less arbitrary and more data intrinsic. The second approach is a novel gene ranking system, the Fuzzy Gene Filter, which provides a more holistic and adaptive approach to ranking genes. The third approach is based on a Stochastic Search paradigm and uses the Population Based Incremental Learning algorithm to identify an optimal gene set with maximum inter-class distinction. All three approaches were implemented and tested on a number of data sets and the results compared to those of standard approaches. The Separability Index approach attained a K-Nearest Neighbour classification accuracy of 92%, outperforming the standard approach which attained an accuracy of 89.6%. The gene list identified also displayed significant functional enrichment. The Fuzzy Gene Filter also outperformed standard approaches, attaining significantly higher accuracies for all of the classifiers tested, on both data sets (p < 0.0231 for the prostate data set and p < 0.1888 for the lymphoma data set). Population Based Incremental Learning outperformed Genetic Algorithm, identifying a maximum Separability Index of 97.04% (as opposed to 96.39%). Future developments include incorporating biological knowledge when ranking genes using the Fuzzy Gene Filter as well as incorporating a functional enrichment assessment in the fitness function of the Population Based Incremental Learning algorithm

    Case-base retrieval of childhood leukaemia patients using gene expression data

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Acute Lymphoblastic Leukaemia (ALL) is the most common childhood malignancy. Nowadays, ALL is diagnosed by a full blood count and a bone marrow biopsy. With microarray technology, it is becoming more feasible to look at the problem from a genetic point of view and to perform assessment for each patient. This thesis proposes a case-base retrieval framework for ALL using a nearest neighbour classifier that can retrieve previously treated patients based on their gene expression data. However, the wealth of gene expression values being generated by high throughout microarray technologies leads to complex high dimensional datasets, and there is a critical need to apply data-mining and computational intelligence techniques to analyse these datasets efficiently. Gene expression datasets are typically noisy and have very high dimensionality. Moreover, gene expression microarray datasets often consist of a limited number of observations relative to the large number of gene expression values (thousands of genes). These characteristics adversely affect the analysis of microarray datasets and pose a challenge for building an efficient gene-based similarity model. Four problems are associated with calculating the similarity between cancer patients on the basis of their gene expression data: feature selection, dimensionality reduction, feature weighting and imbalanced classes. The main contributions of this thesis are: (i) a case-base retrieval framework, (ii) a Balanced Iterative Random Forest algorithm for feature selection, (iii) a Local Principal Component algorithm for dimensionality reduction and visualization and (iv) a Weight Learning Genetic algorithm for feature weighting. This thesis introduces Balanced Iterative Random Forest (BIRF) algorithm for selecting the most relevant features to the disease and discarding the non-relevant genes. Balanced iterative random forest is applied on four cancer microarray datasets: Childhood Leukaemia dataset, Golub Leukaemia dataset, Colon dataset and Lung cancer dataset. Childhood Leukaemia dataset represents the main target of this project and it is collected from The Children's Hospital at Westmead. Patients are classified based on the cancer's risk type (Medium, Standard and High risk); Colon cancer (cancer vs. normal); Golub Leukaemia dataset (acute lymphoblastic leukaemia vs. acute myeloid leukaemia) and Lung cancer (malignant pleural mesothelioma or adenocarcinoma). The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE) and Naive Bayes (NB) classifiers. The BIRF approach results are competitive with these state-of-art methods and better in some cases. The Local Principal Component (LPC) algorithm introduced in this thesis for visualization is validated on three datasets: Childhood Leukaemia, Swiss-roll and Iris datasets. Significant results are achieved with LPC algorithm in comparison to other methods including local linear embedding and principal component analysis. This thesis introduces a Weight Learning Genetic algorithm based on genetic algorithms for feature weighting in the nearest neighbour classifier. The results show that a weighted nearest neighbour classifier with weights generated from the Weight Learning Genetic algorithm produces better results than the un-weighted nearest neighbour algorithm. This thesis also applies synthetic minority over sampling technique (SMOTE) to increase the number of points in the minority classes and reduce the effect of imbalanced classes. The results show that the minority class becomes recognised by the nearest neighbour classifier. SMOTE also reduces the effect of imbalanced classes in predicting the class of new queries especially if the query sample should be classified to the minority class

    Feature selection of microarray data using genetic algorithms and artificial neural networks

    Get PDF
    Microarrays, which allow for the measurement of thousands of gene expression levels in parallel, have created a wealth of data not previously available to biologists along with new computational challenges. Microarray studies are characterized by a low sample number and a large feature space with many features irrelevant to the problem being studied. This makes feature selection a necessary pre-processing step for many analyses, particularly classification. A Genetic Algorithm -Artificial Neural Network (ANN) wrapper approach is implemented to find the highest scoring set of features for an ANN classifier. Each generation relies on the performance of a set of features trained on an ANN for fitness evaluation. A publically-available leukemia microarray data set (Golub et al., 1999), consisting of 25 AML and 47 ALL Leukemia samples, each with 7129 features, is used to evaluate this approach. Results show an increased performance over Golub\u27s initial findings

    Double feature selection and cluster analyses in mining of microarray data from cotton

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cotton fiber is a single-celled seed trichome of major biological and economic importance. In recent years, genomic approaches such as microarray-based expression profiling were used to study fiber growth and development to understand the developmental mechanisms of fiber at the molecular level. The vast volume of microarray expression data generated requires a sophisticated means of data mining in order to extract novel information that addresses fundamental questions of biological interest. One of the ways to approach microarray data mining is to increase the number of dimensions/levels to the analysis, such as comparing independent studies from different genotypes. However, adding dimensions also creates a challenge in finding novel ways for analyzing multi-dimensional microarray data.</p> <p>Results</p> <p>Mining of independent microarray studies from Pima and Upland (TM1) cotton using double feature selection and cluster analyses identified species-specific and stage-specific gene transcripts that argue in favor of discrete genetic mechanisms that govern developmental programming of cotton fiber morphogenesis in these two cultivated species. Double feature selection analysis identified the highest number of differentially expressed genes that distinguish the fiber transcriptomes of developing Pima and TM1 fibers. These results were based on the finding that differences in fibers harvested between 17 and 24 day post-anthesis (dpa) represent the greatest expressional distance between the two species. This powerful selection method identified a subset of genes expressed during primary (PCW) and secondary (SCW) cell wall biogenesis in Pima fibers that exhibits an expression pattern that is generally reversed in TM1 at the same developmental stage. Cluster and functional analyses revealed that this subset of genes are primarily regulated during the transition stage that overlaps the termination of PCW and onset of SCW biogenesis, suggesting that these particular genes play a major role in the genetic mechanism that underlies the phenotypic differences in fiber traits between Pima and TM1.</p> <p>Conclusion</p> <p>The novel application of double feature selection analysis led to the discovery of species- and stage-specific genetic expression patterns, which are biologically relevant to the genetic programs that underlie the differences in the fiber phenotypes in Pima and TM1. These results promise to have profound impacts on the ongoing efforts to improve cotton fiber traits.</p

    Quantitative model for inferring dynamic regulation of the tumour suppressor gene p53

    Get PDF
    Background: The availability of various "omics" datasets creates a prospect of performing the study of genome-wide genetic regulatory networks. However, one of the major challenges of using mathematical models to infer genetic regulation from microarray datasets is the lack of information for protein concentrations and activities. Most of the previous researches were based on an assumption that the mRNA levels of a gene are consistent with its protein activities, though it is not always the case. Therefore, a more sophisticated modelling framework together with the corresponding inference methods is needed to accurately estimate genetic regulation from "omics" datasets. Results: This work developed a novel approach, which is based on a nonlinear mathematical model, to infer genetic regulation from microarray gene expression data. By using the p53 network as a test system, we used the nonlinear model to estimate the activities of transcription factor (TF) p53 from the expression levels of its target genes, and to identify the activation/inhibition status of p53 to its target genes. The predicted top 317 putative p53 target genes were supported by DNA sequence analysis. A comparison between our prediction and the other published predictions of p53 targets suggests that most of putative p53 targets may share a common depleted or enriched sequence signal on their upstream non-coding region. Conclusions: The proposed quantitative model can not only be used to infer the regulatory relationship between TF and its down-stream genes, but also be applied to estimate the protein activities of TF from the expression levels of its target genes

    Identification of disease-causing genes using microarray data mining and gene ontology

    Get PDF
    Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers

    Elephant Search with Deep Learning for Microarray Data Analysis

    Full text link
    Even though there is a plethora of research in Microarray gene expression data analysis, still, it poses challenges for researchers to effectively and efficiently analyze the large yet complex expression of genes. The feature (gene) selection method is of paramount importance for understanding the differences in biological and non-biological variation between samples. In order to address this problem, a novel elephant search (ES) based optimization is proposed to select best gene expressions from the large volume of microarray data. Further, a promising machine learning method is envisioned to leverage such high dimensional and complex microarray dataset for extracting hidden patterns inside to make a meaningful prediction and most accurate classification. In particular, stochastic gradient descent based Deep learning (DL) with softmax activation function is then used on the reduced features (genes) for better classification of different samples according to their gene expression levels. The experiments are carried out on nine most popular Cancer microarray gene selection datasets, obtained from UCI machine learning repository. The empirical results obtained by the proposed elephant search based deep learning (ESDL) approach are compared with most recent published article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl

    Expression profiles of genes regulating dairy cow fertility: recent findings, ongoing activities and future possibilities

    Get PDF
    Subfertility has negative effects for dairy farm profitability, animal welfare and sustainability of animal production. Increasing herd sizes and economic pressures restrict the amount of time that farmers can spend on counteractive management Genetic improvement will become increasingly important to restore reproductive performance. Complementary to traditional breeding value estimation procedures, genomic selection based on genome-wide information will become more widely applied. Functional genomics, including transcriptomics (gene expression profiling), produces the information to understand the consequences of selection as it helps to unravel physiological mechanisms underlying female fertility traits. Insight into the latter is needed to develop new effective management strategies to combat subfertility. Here, the importance of functional genomics for dairy cow reproduction so far and in the near future is evaluated. Recent gene profiling studies in the field of dairy cow fertility are reviewed and new data are presented on genes that are expressed in the brains of dairy cows and that are involved in dairy cow oestrus (behaviour). Fast-developing new research areas in the field of functional genomics, such as epigenetics, RNA interference, variable copy numbers and nutrigenomics are discussed including their promising future value for dairy cow fertility

    Identification of an Efficient Gene Expression Panel for Glioblastoma Classification.

    Get PDF
    We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu
    • …
    corecore