2,511 research outputs found

    Predicting gene ontology from a global meta-analysis of 1-color microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Global meta-analysis (GMA) of microarray data to identify genes with highly similar co-expression profiles is emerging as an accurate method to predict gene function and phenotype, even in the absence of published data on the gene(s) being analyzed. With a third of human genes still uncharacterized, this approach is a promising way to direct experiments and rapidly understand the biological roles of genes. To predict function for genes of interest, GMA relies on a guilt-by-association approach to identify sets of genes with known functions that are consistently co-expressed with it across different experimental conditions, suggesting coordinated regulation for a specific biological purpose. Our goal here is to define how sample, dataset size and ranking parameters affect prediction performance.</p> <p>Results</p> <p>13,000 human 1-color microarrays were downloaded from GEO for GMA analysis. Prediction performance was benchmarked by calculating the distance within the Gene Ontology (GO) tree between predicted function and annotated function for sets of 100 randomly selected genes. We find the number of new predicted functions rises as more datasets are added, but begins to saturate at a sample size of approximately 2,000 experiments. For the gene set used to predict function, we find precision to be higher with smaller set sizes, yet with correspondingly poor recall and, as set size is increased, recall and F-measure also tend to increase but at the cost of precision.</p> <p>Conclusions</p> <p>Of the 20,813 genes expressed in 50 or more experiments, at least one predicted GO category was found for 72.5% of them. Of the 5,720 genes without GO annotation, 4,189 had at least one predicted ontology using top 40 co-expressed genes for prediction analysis. For the remaining 1,531 genes without GO predictions or annotations, ~17% (257 genes) had sufficient co-expression data yet no statistically significantly overrepresented ontologies, suggesting their regulation may be more complex.</p

    High-throughput processing and normalization of one-color microarrays for transcriptional meta-analyses

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray experiments are becoming increasingly common in biomedical research, as is their deposition in publicly accessible repositories, such as Gene Expression Omnibus (GEO). As such, there has been a surge in interest to use this microarray data for meta-analytic approaches, whether to increase sample size for a more powerful analysis of a specific disease (e.g. lung cancer) or to re-examine experiments for reasons different than those examined in the initial, publishing study that generated them. For the average biomedical researcher, there are a number of practical barriers to conducting such meta-analyses such as manually aggregating, filtering and formatting the data. Methods to automatically process large repositories of microarray data into a standardized, directly comparable format will enable easier and more reliable access to microarray data to conduct meta-analyses.</p> <p>Methods</p> <p>We present a straightforward, simple but robust against potential outliers method for automatic quality control and pre-processing of tens of thousands of single-channel microarray data files. GEO GDS files are quality checked by comparing parametric distributions and quantile normalized to enable direct comparison of expression level for subsequent meta-analyses.</p> <p>Results</p> <p>13,000 human 1-color experiments were processed to create a single gene expression matrix that subsets can be extracted from to conduct meta-analyses. Interestingly, we found that when conducting a global meta-analysis of gene-gene co-expression patterns across all 13,000 experiments to predict gene function, normalization had minimal improvement over using the raw data.</p> <p>Conclusions</p> <p>Normalization of microarray data appears to be of minimal importance on analyses based on co-expression patterns when the sample size is on the order of thousands microarray datasets. Smaller subsets, however, are more prone to aberrations and artefacts, and effective means of automating normalization procedures not only empowers meta-analytic approaches, but aids in reproducibility by providing a standard way of approaching the problem.</p> <p>Data availability: matrix containing normalized expression of 20,813 genes across 13,000 experiments is available for download at . Source code for GDS files pre-processing is available from the authors upon request.</p

    A Semi-Supervised Method for Predicting Transcription Factorā€“Gene Interactions in Escherichia coli

    Get PDF
    While Escherichia coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. This presents a problem when trying to combine gene expression and regulatory interactions to model transcriptional regulatory networks. Using the available regulatory interactions to predict new interactions may lead to better coverage and more accurate models. Here, we develop SEREND (SEmi-supervised REgulatory Network Discoverer), a semi-supervised learning method that uses a curated database of verified transcriptional factorā€“gene interactions, DNA sequence binding motifs, and a compendium of gene expression data in order to make thousands of new predictions about transcription factorā€“gene interactions, including whether the transcription factor activates or represses the gene. Using genome-wide binding datasets for several transcription factors, we demonstrate that our semi-supervised classification strategy improves the prediction of targets for a given transcription factor. To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli. We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response. The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobicā€“anaerobic shift interface

    Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays

    Full text link
    Volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from the t test). We review the basic and an interactive use of the volcano plot, and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide an unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility to apply volcano plots to other fields beyond microarray.Comment: 8 figure

    Heterologous Tissue Culture Expression Signature Predicts Human Breast Cancer Prognosis

    Get PDF
    BACKGROUND: Cancer patients have highly variable clinical outcomes owing to many factors, among which are genes that determine the likelihood of invasion and metastasis. This predisposition can be reflected in the gene expression pattern of the primary tumor, which may predict outcomes and guide the choice of treatment better than other clinical predictors. METHODOLOGY/PRINCIPAL FINDINGS: We developed an mRNA expression-based model that can predict prognosis/outcomes of human breast cancer patients regardless of microarray platform and patient group. Our model was developed using genes differentially expressed in mouse plasma cell tumors growing in vivo versus those growing in vitro. The prediction system was validated using published data from three cohorts of patients for whom microarray and clinical data had been compiled. The model stratified patients into four independent survival groups (BEST, GOOD, BAD, and WORST: log-rank test pā€Š=ā€Š1.7Ɨ10(āˆ’8)). CONCLUSIONS: Our model significantly improved the survival prediction over other expression-based models and permitted recognition of patients with different prognoses within the estrogen receptor-positive group and within a single pathological tumor class. Basing our predictor on a dataset that originated in a different species and a different cell type may have rendered it less sensitive to proliferation differences and endowed it with wide applicability. SIGNIFICANCE: Prognosis prediction for patients with breast cancer is currently based on histopathological typing and estrogen receptor positivity. Yet both assays define groups that are heterogeneous in survival. Gene expression profiling allows subdivision of these groups and recognition of patients whose tumors are very unlikely to be lethal and those with much grimmer outlooks, which can augment the predictive power of conventional tumor analysis and aid the clinician in choosing relaxed vs. aggressive therapy

    Bioinformatics resources for cancer research with an emphasis on gene function and structure prediction tools

    Get PDF
    The immensely popular fields of cancer research and bioinformatics overlap in many different areas, e.g. large data repositories that allow for users to analyze data from many experiments (data handling, databases), pattern mining, microarray data analysis, and interpretation of proteomics data. There are many newly available resources in these areas that may be unfamiliar to most cancer researchers wanting to incorporate bioinformatics tools and analyses into their work, and also to bioinformaticians looking for real data to develop and test algorithms. This review reveals the interdependence of cancer research and bioinformatics, and highlight the most appropriate and useful resources available to cancer researchers. These include not only public databases, but general and specific bioinformatics tools which can be useful to the cancer researcher. The primary foci are function and structure prediction tools of protein genes. The result is a useful reference to cancer researchers and bioinformaticians studying cancer alike

    Predicting genome-wide redundancy using machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as <it>Arabidopsis thaliana</it>, the test case used here.</p> <p>Results</p> <p>Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in <it>Arabidopsis </it>showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks > 1), suggesting that redundancy is stable over long evolutionary periods.</p> <p>Conclusions</p> <p>Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for <it>Arabidopsis </it>provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.</p

    Exploring Gene Regulatory Interaction Networks and predicting therapeutic molecules for Hypopharyngeal Cancer and EGFR-mutated lung adenocarcinoma

    Full text link
    With the advent of Information technology, the Bioinformatics research field is becoming increasingly attractive to researchers and academicians. The recent development of various Bioinformatics toolkits has facilitated the rapid processing and analysis of vast quantities of biological data for human perception. Most studies focus on locating two connected diseases and making some observations to construct diverse gene regulatory interaction networks, a forerunner to general drug design for curing illness. For instance, Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lung adenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma and Hypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer. To conduct this study, we collect Mircorarray datasets from GEO (Gene Expression Omnibus), an online database controlled by NCBI. Differentially expressed genes, common genes, and hub genes between the selected two diseases are detected for the succeeding move. Our research findings have suggested common therapeutic molecules for the selected diseases based on 10 hub genes with the highest interactions according to the degree topology method and the maximum clique centrality (MCC). Our suggested therapeutic molecules will be fruitful for patients with those two diseases simultaneously.Comment: Accepted In The FEBS OPEN BIO (Q2, SCOPUS, SCIE, IF: 2.6, CS: 4.7), Wiley Journal, On FEB 25, 202
    • ā€¦
    corecore