62 research outputs found

    Integrating data from heterogeneous DNA microarray platforms

    Get PDF
    DNA microarrays are one of the most used technologies for gene expression measurement. However, there are several distinct microarray platforms, from different manufacturers, each with its own measurement protocol, resulting in data that can hardly be compared or directly integrated. Data integration from multiple sources aims to improve the assertiveness of statistical tests, reducing the data dimensionality problem. The integration of heterogeneous DNA microarray platforms comprehends a set of tasks that range from the re-annotation of the features used on gene expression, to data normalization and batch effect elimination. In this work, a complete methodology for gene expression data integration and application is proposed, which comprehends a transcript-based re-annotation process and several methods for batch effect attenuation. The integrated data will be used to select the best feature set and learning algorithm for a brain tumor classification case study. The integration will consider data from heterogeneous Agilent and Affymetrix platforms, collected from public gene expression databases, such as The Cancer Genome Atlas and Gene Expression Omnibus.The authors thank the FCT Strategic Project of UID/BIO/04469/2013 unit, the project RECI/BBBEBI/0179/2012 (FCOMP-01-0124-FEDER-027462) and the project BioInd - Biotechnology and Bioengineering for improved Industrial and Agro-Foodprocesses”, REF.NORTE-07-0124FEDER-000028 Co-funded by the Programa Operacional Regional do Norte (ON.2 O Novo Norte), QREN, FEDER

    An Analysis of Global Gene Expression Resulting from Exposure to Energetic Materials

    Get PDF
    AN ANALYSIS OF GLOBAL GENE EXPRESSION RESULTING FROM EXPOSURE TO ENERGETIC MATERIALS A Dissertation Presented for the Doctor of Philosophy Degree University of Tennessee, Knoxville VERNON LASHAWN MCINTOSH JR. August 2010 Dedication This dissertation is dedicated to my family. My mother and father Debra and Vernon McIntosh instilled in me the respect for academic excellence and the drive maximize my potential. Early on, my younger brother Kyle started showing signs of a shared interest in biology thus my desire to be a positive role model for him kept me motivated. Last but certainly not least, my loving wife and best friend Nichole has been there to offer love and support throughout my entire undergraduate and graduate degrees. It’s difficult to imagine making it this far without her (and that’s not just because she paid the bills). Abstract Characteristic transcriptional biomarkers have been identified for microbial cultures exposed to 2, 4, 6-trinitrotoluene (TNT), 2, 6-dinitrotoluene (DNT), or triacetone-triperoxide (TATP). This study describes the generation of expression profiles for exposure to each compound, the functional significance of each response, and the identification of the characteristic alterations in gene expression associated with exposure to each compound. Expression profiles were generated from a total of three different candidate organisms: Escherichia coli, Saccharomyces cerevisiae, and Pseudomonas putida. Common to all three organisms, TNT exposure resulted in increased expression of genes involved in toxin resistance and drug efflux systems. The S.cerevisiae and E.coli expression profiles were both characterized by increased expression of genes involved in iron-sulfur cluster assembly, sulfur containing amino acids, sulfate transport and assimilation and the metabolism of nitrogen compounds. Only E.coli and Saccharomyces were used to generate DNT induced expression profiles; both profiles exhibited high degrees of similarity with each organism’s respective TNT profiles. This was especially true of the E.coli profile where 25 of the 30 alterations were also observed after exposure to TNT. A computational discriminant functional analysis was performed to identify characteristic biomarkers for each exposure. For each compound a set of transcriptional biomarkers (10 or less) was developed. An additional set of biomarkers was developed encompassing both TNT and DNT exposure. These sets of genes serve as a transcriptional fingerprint for exposure to each respective compound. The sensitivity and specificity of each transcriptional fingerprint is sufficient to correctly identify exposure to energetic materials against a background of non-energetic compound exposures. This study makes several novel contributions to the greater body of scientific knowledge: • This is the first documented study of the interactions of TATP in any biological system. • This is the first comprehensive gene expression study of the TNT response by P. putida, E.coli or E.coli. • This is the first application of computational class prediction in the development of biomarkers for exposure to energetic material

    The classification for High-dimension low-sample size data

    Full text link
    Huge amount of applications in various fields, such as gene expression analysis or computer vision, undergo data sets with high-dimensional low-sample-size (HDLSS), which has putted forward great challenges for standard statistical and modern machine learning methods. In this paper, we propose a novel classification criterion on HDLSS, tolerance similarity, which emphasizes the maximization of within-class variance on the premise of class separability. According to this criterion, a novel linear binary classifier is designed, denoted by No-separated Data Maximum Dispersion classifier (NPDMD). The objective of NPDMD is to find a projecting direction w in which all of training samples scatter in as large an interval as possible. NPDMD has several characteristics compared to the state-of-the-art classification methods. First, it works well on HDLSS. Second, it combines the sample statistical information and local structural information (supporting vectors) into the objective function to find the solution of projecting direction in the whole feature spaces. Third, it solves the inverse of high dimensional matrix in low dimensional space. Fourth, it is relatively simple to be implemented based on Quadratic Programming. Fifth, it is robust to the model specification for various real applications. The theoretical properties of NPDMD are deduced. We conduct a series of evaluations on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. NPDMD outperforms those widely used approaches in most cases, or at least obtains comparable results.Comment: arXiv admin note: text overlap with arXiv:1901.0137

    A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

    Get PDF
    Batch effects are the systematic non-biological differences between batches (groups) of samples in microarray experiments due to various causes such as differences in sample preparation and hybridization protocols. Previous work focused mainly on the development of methods for effective batch effects removal. However, their impact on cross-batch prediction performance, which is one of the most important goals in microarray-based applications, has not been addressed. This paper uses a broad selection of data sets from the Microarray Quality Control Phase II (MAQC-II) effort, generated on three microarray platforms with different causes of batch effects to assess the efficacy of their removal. Two data sets from cross-tissue and cross-platform experiments are also included. Of the 120 cases studied using Support vector machines (SVM) and K nearest neighbors (KNN) as classifiers and Matthews correlation coefficient (MCC) as performance metric, we find that Ratio-G, Ratio-A, EJLR, mean-centering and standardization methods perform better or equivalent to no batch effect removal in 89, 85, 83, 79 and 75% of the cases, respectively, suggesting that the application of these methods is generally advisable and ratio-based methods are preferred

    Integrative disease classification based on cross-platform microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Disease classification has been an important application of microarray technology. However, most microarray-based classifiers can only handle data generated within the same study, since microarray data generated by different laboratories or with different platforms can not be compared directly due to systematic variations. This issue has severely limited the practical use of microarray-based disease classification.</p> <p>Results</p> <p>In this study, we tested the feasibility of disease classification by integrating the large amount of heterogeneous microarray datasets from the public microarray repositories. Cross-platform data compatibility is created by deriving expression log-rank ratios within datasets. One may then compare vectors of log-rank ratios across datasets. In addition, we systematically map textual annotations of datasets to concepts in Unified Medical Language System (UMLS), permitting quantitative analysis of the phenotype "distance" between datasets and automated construction of disease classes. We design a new classification approach named ManiSVM, which integrates Manifold data transformation with SVM learning to exploit the data properties. Using the leave one dataset out cross validation, ManiSVM achieved the overall accuracy of 70.7% (68.6% precision and 76.9% recall) with many disease classes achieving the accuracy higher than 80%.</p> <p>Conclusion</p> <p>Our results not only demonstrated the feasibility of the integrated disease classification approach, but also showed that the classification accuracy increases with the number of homogenous training datasets. Thus, the power of the integrative approach will increase with the continuous accumulation of microarray data in public repositories. Our study shows that automated disease diagnosis can be an important and promising application of the enormous amount of costly to generate, yet freely available, public microarray data.</p

    Genome-wide inference of regulatory networks in Streptomyces coelicolor

    Get PDF
    Background: The onset of antibiotics production in Streptomyces species is co-ordinated with differentiation events. An understanding of the genetic circuits that regulate these coupled biological phenomena is essential to discover and engineer the pharmacologically important natural products made by these species. The availability of genomic tools and access to a large warehouse of transcriptome data for the model organism, Streptomyces coelicolor, provides incentive to decipher the intricacies of the regulatory cascades and develop biologically meaningful hypotheses. Results: In this study, more than 500 samples of genome-wide temporal transcriptome data, comprising wild-type and more than 25 regulatory gene mutants of Streptomyces coelicolor probed across multiple stress and medium conditions, were investigated. Information based on transcript and functional similarity was used to update a previously-predicted whole-genome operon map and further applied to predict transcriptional networks constituting modules enriched in diverse functions such as secondary metabolism, and sigma factor. The predicted network displays a scale-free architecture with a small-world property observed in many biological networks. The networks were further investigated to identify functionally-relevant modules that exhibit functional coherence and a consensus motif in the promoter elements indicative of DNA-binding elements. Conclusions: Despite the enormous experimental as well as computational challenges, a systems approach for integrating diverse genome-scale datasets to elucidate complex regulatory networks is beginning to emerge. We present an integrated analysis of transcriptome data and genomic features to refine a whole-genome operon map and to construct regulatory networks at the cistron level in Streptomyces coelicolor. The functionally-relevant modules identified in this study pose as potential targets for further studies and verification.

    Inferring a Transcriptional Regulatory Network from Gene Expression Data Using Nonlinear Manifold Embedding

    Get PDF
    Transcriptional networks consist of multiple regulatory layers corresponding to the activity of global regulators, specialized repressors and activators of transcription as well as proteins and enzymes shaping the DNA template. Such intrinsic multi-dimensionality makes uncovering connectivity patterns difficult and unreliable and it calls for adoption of methodologies commensurate with the underlying organization of the data source. Here we present a new computational method that predicts interactions between transcription factors and target genes using a compendium of microarray gene expression data and the knowledge of known interactions between genes and transcription factors. The proposed method called Kernel Embedding of REgulatory Networks (KEREN) is based on the concept of gene-regulon association and it captures hidden geometric patterns of the network via manifold embedding. We applied KEREN to reconstruct gene regulatory interactions in the model bacteria E.coli on a genome-wide scale. Our method not only yields accurate prediction of verifiable interactions, which outperforms on certain metrics comparable methodologies, but also demonstrates the utility of a geometric approach to the analysis of high-dimensional biological data. We also describe the general application of kernel embedding techniques to some other function and network discovery algorithms

    Non-parametric algorithms for evaluating gene expression in cancer using DNA microarray technology

    Get PDF
    Microarray technology has transformed the field of cancer biology by enabling the simultaneous evaluation of tens of thousands mRNA expression levels in a single experiment. This technology has been applied to medical science in order to find gene expression markers that cluster diseased and normal tissues, genes affected by treatments, and gene network interactions. All methods of microarray data analysis can be summarized as a study of differential gene expression. This study addresses three questions, 1) the roles of selectively expressed genes for the classification of cancer, 2) issues of accounting for both experimental and biological noise, and 3) issues of comparing data derived from different research groups using the Affymetrix GeneChipTM platform. A key finding of this study is that selectively expressed genes are very powerful when used for disease classification. A model was designed to reduce noise and eliminate false positives from true results. With this approach, data from different research groups can be integrated to increase information and enable a better understanding of cancer