7 research outputs found

    Utility and Limitations of Using Gene Expression Data to Identify Functional Associations

    No full text
    <div><p>Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using <i>Arabidopsis thaliana</i> as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets.</p></div

    Impact of datasets on pathway EC percentile.

    No full text
    <p><b>(A)</b> Relationship between pathway EC percentiles calculated using the combined stress gene expression dataset and those calculated based on one of the individual stress datasets, abiotic/shoot. <b>(B)</b> Relationship between pathway EC percentiles calculated using the light, development and stress combined dataset and those calculated based on individual dataset, stress. In (A) and (B) the dashed line represents <i>y</i> = <i>x</i>, and each dot represents a pathway. <b>(C)</b> Individual and combinations of datasets used to determine pathway EC Percentiles. *: NASCArray consisting of all the datasets listed here as well as additional datasets (~700 samples). The columns in (C) correspond to those in (D) and (E). <b>(D)</b> Bar plot of percent high EC pathways using different expression datasets <b>(E)</b> Heat map of pathway EC percentiles from 13 gene expression datasets. Dark red: EC percentiles≥ 95. Orange: 95 > EC percentiles < 75. Yellow: 75 > EC percentiles <50, Blue: 50 > EC percentiles < 0 <b>(F)</b> Histogram of the numbers of datasets leading to high EC values for each pathway. Example pathways are labeled with an arrow.</p

    Performance of clusters in predicting pathways.

    No full text
    <p><b>(A)</b> Histogram of the maximum scores (-log(<i>q</i>)) for over-representation of pathways within clusters. <b>(B)</b> Histogram of the maximum F measures for prediction of pathway membership based on cluster membership. <b>(C)</b> Relationship between precision and recall for clusters. In (A-C), clusters were generated using <i>k</i>-means with <i>k</i> = 100. <b>(D)</b> Heat map of over-representation scores obtained from different individual and combined clustering algorithms (top) and cluster numbers (bottom) Color represents over-representation scores (-log(<i>q</i>)) from 0 to 12. Scores less than 1.3 are indicated by dark blue. Scores more than 1.3 are represented by a spectrum of light blue to red. Pathways in the heat map are sorted based on the number of times that they are over-represented in the clusters, high to low. <b>(E)</b> Bar plot showing the difference between overall maximum over-representation score—the highest score from any single cluster—and the over-representation score from clusters generated using <i>k</i>-means, <i>k</i> = 100 for each pathway. <b>(F)</b> Bar plot showing the difference between the overall maximum F measure—the highest score from any single cluster—and the F measure from clusters generated using <i>k</i>-means, <i>k</i> = 100 for each pathway. <b>(G)</b> Bar plot showing the difference between maximum Precision—the highest score from any single cluster—and the Precision from clusters generated using <i>k</i>-means, <i>k</i> = 100 from each pathway. Arrow: performance values for the leucine degradation pathway.</p

    Relationship between pathway ECs, annotation quality and similarity measures.

    No full text
    <p><b>(A)</b> Relationship between the EC calculated for pathway genes that are annotated based on experimental evidence (ECexp) and EC calculated for pathway genes that are annotated only based on computational evidence (ECcomp). The genes used to calculate ECexp and ECcomp do not overlap. Each dot represents one pathway. Dashed line: <i>y</i> = <i>x</i> line. <b>(B)</b> Heatmap of correlations between pathway EC percentiles calculated with: partial correlations estimated with the corpcor method, Spearman’s rank correlation coefficient (Spearman), Pearson Correlation Coefficient (PCC), adjusted and normalized Mutual Information (MI), partial correlation calculated with the partialcorr method, and transformed <i>p-</i>values of Bayesian Network (BN) (<b>C)</b> Percent pathways that have high EC using different similarity measures. <b>(D)</b> Heatmap of pathway EC percentiles calculated using different similarity measures. Color represents EC percentiles. White dotted rectangles: high EC pathways that are specific to one measure.</p

    Impact of pathway size and other factors on EC.

    No full text
    <p><b>(A)</b> Relationship between ECexp of a pathway and pathway size (the number of genes assigned to a pathway). <b>(B)</b> ECexp value distribution for pathway genes with products that have subcellular location annotations. PM: Plasma membrane. ER: Endoplasmic reticulum. <b>(C)</b> ECexp value distribution for different pathway classes (general pathway categories). <b>(D)</b> Datasets used to determine pathway ECs. A “<b>+</b>” indicates that the dataset in question was used (either individually or in combination) for the analyses depicted by bar graphs in (E) and (F). The columns in (D) correspond to those in (E) and (F). <b>(E)</b> The 95th percentile PCC values (PCC95) in the null distributions for each dataset or combination of datasets. PCC95 of combined datasets (stress fold change and light (L)+stress (S)+development (D) absolute intensity) are labeled in the bar plot <b>(F)</b> Number of pathways with high EC for each dataset and/or combination of datasets. Green: fold change values were used to calculate ECs. Orange: absolute intensity values were used for calculating ECs.</p

    Co-expression of <i>A</i>. <i>thaliana</i> pathway genes under stress.

    No full text
    <p><b>(A)</b> Boxplots of expression correlations (Pearson’s Correlation Coefficient, PCC) between pairs of genes in each <i>A</i>. <i>thaliana</i> metabolic pathway (left sub-figure) and random gene pairs (right sub-figure). The pathways are sorted based on median PCC. Light blue boxes: Interquartile range. Blue line: median PCCs. Red dashed line: the 95<sup>th</sup> percentile PCC value (PCC95 = 0.41) of the random gene pair PCC distribution. Black dashed line: the median PCC of the random gene pair PCC distribution. <b>(B)</b> Bar plot indicating ECs for <i>A</i>. <i>thaliana</i> pathways (left sub-figure) and random gene pairs (right sub-figure). The pathways are in the same order as in (A). The insert graph shows the number of pathways that have significantly higher ECs than randomly expected (black) and those that are not significant (white). Different percentile thresholds on the x-axis are based on the random EC distribution (right sub-figure). The red dashed line designates the 95<sup>th</sup> percentile of the random EC distribution.</p

    Assessing the validity of co-expression associations with leucine measurement data.

    No full text
    <p><b>(A)</b> Log<sub>2</sub> of the number leucine degradation (LeuDeg) gene mutants, random gene mutants, and wild-type (WT) control plants that were included in the analysis of leucine levels. <b>(B)</b> The absolute leucine levels (nM/gFW) in the same three types of genetic background as in (A). <b>(C)</b> Log-odds values (log ratio between the proportion of leucine degradation genes in a cluster to the proportion of non-leucine degradation genes in the same cluster) of clusters that are enriched in leucine degradation genes identified using different algorithm-size parameter combinations. <b>(D)</b> Log2 of the number of genes that cluster with leucine degradation pathway genes (over-representation score >1.3) for each algorithm-size parameter combination. <b>(E)</b> Box plot showing the absolute seed leucine levels (nM/gFW) in plants with T-DNA insertions in genes clustered with leucine degradation pathway genes (enrichment score >1.3) for each algorithm-cluster size parameter combination. *: the groups of genes where mutant leucine levels are significantly greater than in wild type. <b>(F)</b> The absolute seed leucine levels (nM/gFW) of T-DNA insertion mutants of genes that cluster with leucine degradation genes. Binning in x-axis depends on the number of times that each gene clusters with leucine degradation genes considering all clustering results.</p
    corecore