145 research outputs found
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.
BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies.
FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects.
DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects.
METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data
MageComet—web application for harmonizing existing large-scale experiment descriptions
Motivation: Meta-analysis of large gene expression datasets obtained from public repositories requires consistently annotated data. Curation of such experiments, however, is an expert activity which involves repetitive manipulation of text. Existing tools for automated curation are few, which bottleneck the analysis pipeline
Analysis of gene expression data from non-small celllung carcinoma cell lines reveals distinct sub-classesfrom those identified at the phenotype level
Microarray data from cell lines of Non-Small Cell Lung Carcinoma (NSCLC) can be used to look for differences in gene expression between the cell lines derived from different tumour samples, and to investigate if these differences can be used to cluster the cell lines into distinct groups. Dividing the cell lines into classes can help to improve diagnosis and the development of screens for new drug candidates. The micro-array data is first subjected to quality control analysis and then subsequently normalised using three alternate methods to reduce the chances of differences being artefacts resulting from the normalisation process. The final clustering into sub-classes was carried out in a conservative manner such that subclasses were consistent across all three normalisation methods. If there is structure in the cell line population it was expected that this would agree with histological classifications, but this was not found to be the case. To check the biological consistency of the sub-classes the set of most strongly differentially expressed genes was be identified for each pair of clusters to check if the genes that most strongly define sub-classes have biological functions consistent with NSCLC
The Pathway Coexpression Network: Revealing pathway relationships.
A goal of genomics is to understand the relationships between biological processes. Pathways contribute to functional interplay within biological processes through complex but poorly understood interactions. However, limited functional references for global pathway relationships exist. Pathways from databases such as KEGG and Reactome provide discrete annotations of biological processes. Their relationships are currently either inferred from gene set enrichment within specific experiments, or by simple overlap, linking pathway annotations that have genes in common. Here, we provide a unifying interpretation of functional interaction between pathways by systematically quantifying coexpression between 1,330 canonical pathways from the Molecular Signatures Database (MSigDB) to establish the Pathway Coexpression Network (PCxN). We estimated the correlation between canonical pathways valid in a broad context using a curated collection of 3,207 microarrays from 72 normal human tissues. PCxN accounts for shared genes between annotations to estimate significant correlations between pathways with related functions rather than with similar annotations. We demonstrate that PCxN provides novel insight into mechanisms of complex diseases using an Alzheimer's Disease (AD) case study. PCxN retrieved pathways significantly correlated with an expert curated AD gene list. These pathways have known associations with AD and were significantly enriched for genes independently associated with AD. As a further step, we show how PCxN complements the results of gene set enrichment methods by revealing relationships between enriched pathways, and by identifying additional highly correlated pathways. PCxN revealed that correlated pathways from an AD expression profiling study include functional clusters involved in cell adhesion and oxidative stress. PCxN provides expanded connections to pathways from the extracellular matrix. PCxN provides a powerful new framework for interrogation of global pathway relationships. Comprehensive exploration of PCxN can be performed at http://pcxn.org/
A gene expression atlas of the domestic pig
<p>Abstract</p> <p>Background</p> <p>This work describes the first genome-wide analysis of the transcriptional landscape of the pig. A new porcine Affymetrix expression array was designed in order to provide comprehensive coverage of the known pig transcriptome. The new array was used to generate a genome-wide expression atlas of pig tissues derived from 62 tissue/cell types. These data were subjected to network correlation analysis and clustering.</p> <p>Results</p> <p>The analysis presented here provides a detailed functional clustering of the pig transcriptome where transcripts are grouped according to their expression pattern, so one can infer the function of an uncharacterized gene from the company it keeps and the locations in which it is expressed. We describe the overall transcriptional signatures present in the tissue atlas, where possible assigning those signatures to specific cell populations or pathways. In particular, we discuss the expression signatures associated with the gastrointestinal tract, an organ that was sampled at 15 sites along its length and whose biology in the pig is similar to human. We identify sets of genes that define specialized cellular compartments and region-specific digestive functions. Finally, we performed a network analysis of the transcription factors expressed in the gastrointestinal tract and demonstrate how they sub-divide into functional groups that may control cellular gastrointestinal development.</p> <p>Conclusions</p> <p>As an important livestock animal with a physiology that is more similar than mouse to man, we provide a major new resource for understanding gene expression with respect to the known physiology of mammalian tissues and cells. The data and analyses are available on the websites <url>http://biogps.org and http://www.macrophages.com/pig-atlas</url>.</p
ArrayExpress—a public database of microarray experiments and gene expression profiles
ArrayExpress is a public database for high throughput functional genomics data. ArrayExpress consists of two parts—the ArrayExpress Repository, which is a MIAME supportive public archive of microarray data, and the ArrayExpress Data Warehouse, which is a database of gene expression profiles selected from the repository and consistently re-annotated. Archived experiments can be queried by experiment attributes, such as keywords, species, array platform, authors, journals or accession numbers. Gene expression profiles can be queried by gene names and properties, such as Gene Ontology terms and gene expression profiles can be visualized. ArrayExpress is a rapidly growing database, currently it contains data from >50 000 hybridizations and >1 500 000 individual expression profiles. ArrayExpress supports community standards, including MIAME, MAGE-ML and more recently the proposal for a spreadsheet based data exchange format: MAGE-TAB. Availability:
Fast approximate hierarchical clustering using similarity heuristics
© 2008 Kull and Vilo; licensee BioMed Central Ltd
ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments
The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy
Aberrant methylation of tRNAs links cellular stress to neuro-developmental disorders.
Mutations in the cytosine-5 RNA methyltransferase NSun2 cause microcephaly and other neurological abnormalities in mice and human. How post-transcriptional methylation contributes to the human disease is currently unknown. By comparing gene expression data with global cytosine-5 RNA methylomes in patient fibroblasts and NSun2-deficient mice, we find that loss of cytosine-5 RNA methylation increases the angiogenin-mediated endonucleolytic cleavage of transfer RNAs (tRNA) leading to an accumulation of 5' tRNA-derived small RNA fragments. Accumulation of 5' tRNA fragments in the absence of NSun2 reduces protein translation rates and activates stress pathways leading to reduced cell size and increased apoptosis of cortical, hippocampal and striatal neurons. Mechanistically, we demonstrate that angiogenin binds with higher affinity to tRNAs lacking site-specific NSun2-mediated methylation and that the presence of 5' tRNA fragments is sufficient and required to trigger cellular stress responses. Furthermore, the enhanced sensitivity of NSun2-deficient brains to oxidative stress can be rescued through inhibition of angiogenin during embryogenesis. In conclusion, failure in NSun2-mediated tRNA methylation contributes to human diseases via stress-induced RNA cleavage
A global insight into a cancer transcriptional space using pancreatic data: importance, findings and flaws
Despite the increasing wealth of available data, the structure of cancer transcriptional space remains largely unknown. Analysis of this space would provide novel insights into the complexity of cancer, assess relative implications in complex biological processes and responses, evaluate the effectiveness of cancer models and help uncover vital facets of cancer biology not apparent from current small-scale studies. We conducted a comprehensive analysis of pancreatic cancer-expression space by integrating data from otherwise disparate studies. We found (i) a clear separation of profiles based on experimental type, with patient tissue samples, cell lines and xenograft models forming distinct groups; (ii) three subgroups within the normal samples adjacent to cancer showing disruptions to biofunctions previously linked to cancer; and (iii) that ectopic subcutaneous xenografts and cell line models do not effectively represent changes occurring in pancreatic cancer. All findings are available from our online resource for independent interrogation. Currently, the most comprehensive analysis of pancreatic cancer to date, our study primarily serves to highlight limitations inherent with a lack of raw data availability, insufficient clinical/histopathological information and ambiguous data processing. It stresses the importance of a global-systems approach to assess and maximise findings from expression profiling of malignant and non-malignant diseases
- …
