19 research outputs found

    MODMatcher: Multi-Omics Data Matcher for Integrative Genomic Analysis

    No full text
    <div><p>Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets.</p></div

    Gender prediction based on expression of the Y-chromosome specific gene <i>RPS4Y1</i>.

    No full text
    <p>The log2 transformed values of <i>RPS4Y1</i> expression level are clearly separated between male and female samples both in CTRL and patients with COPD (>10 in male samples and <10 in female samples). There were no gender mismatched samples in the CTRL and 5 mismatched samples (2 in females and 3 in males) in the COPD set (error rate of 1.5%).</p

    Sample alignment with MODMatcher.

    No full text
    <p>Initial labels of samples are used to determine cis pairs, which are then used to calculate similarity scores. Based on the similarity scores determined with three data types, the molecular data are matched with each other (1) by gender, (2) by cis-eSNPs, (3) by cis-mSNPs, (4) by cis mRNA-methylation pairs, and (5) by all trio mapping. Then, updated sample pairs are used to calculate new cis pairs for another round of alignment. Rounds of alignment are repeated until there are no further changes.</p

    Examples of sample alignment in the TCGA BRCA data set.

    No full text
    <p>(A) A similarity score distribution of a correctly labeled profile. The red star indicates the similarity score between self-matched profile pairs (gene expression and methylation data profiles are labeled as pertaining to the same sample). (B) Similarity scores of self-matched pairs (red stars) between gene expression and methylation profiles for two samples are lower than the similarity scores of cross-matched pairs (blue stars).</p

    Sample similarity measurement based on cis methylation-mRNA pairs.

    No full text
    <p>After cis methylation-mRNA pairs are identified, the methylation and gene expression levels were rank-transformed. In this figure, there are M samples and <i>i</i> cis pairs. Then Pearson correlation is calculated and used as sample similarity, , between one methylation profile and all gene expression profiles. If both methylation and gene expression profiles are from the same individual, self-self correlation coefficient is expected to be significantly higher than correlation coefficients with other samples.</p

    Comparison of sample alignment procedures based on three or two data types in simulated datasets.

    No full text
    <p>A total 65 COPD samples with all three types of data (gene expression, genotype, and methylation) were used. The mis-labeling error rates were fixed at 3% between gene expression and genotypes. The number of mis-aligned pairs was varied from 0 to 24 (corresponding error rate, 0% to 37%). Two sample alignment procedures were applied to the simulated data sets and final aligned pairs were compared with the true alignment. Triangles, duo-alignment results; circles, trio-alignment results. Numbers inside triangles or circles indicate the number of mis-aligned samples in each simulation. Coverage is defined as the number of correctly aligned pairs divided by 65 (the number of original pairs). The true positive rate is defined as the number of correctly aligned pairs divided by all aligned pairs.</p

    Gender prediction based on genotype data.

    No full text
    <p>The inbreeding coefficient F, the X chromosome heterozygosity rate, is used to infer the gender of samples. F is around 0 in most female samples and around 1 in most male samples. For 9 samples, the inferred genders were inconsistent with clinically annotated genders (error rate 3.5%).</p

    Assessment of sample alignment quality.

    No full text
    <p>The number of cis pairs is counted after each round of alignment. The number of cis pairs increased markedly after alignment in both the CTRL and COPD sets. The exact numbers of cis-pairs are listed in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003790#pcbi.1003790.s006" target="_blank">Table S2</a>. A) cis-eQTLs. B) cis-mQTLs. C) cis mRNA-methylation pairs.</p

    Gender prediction based on methylation intensity.

    No full text
    <p>The raw intensity of a Y-chromosome methyl probe corresponding to <i>FAM197Y2P</i> is clearly different between genders. One error was identified in the CTRL and 15 errors were identified in the COPD set (6 in females, 9 in males) (error rate of 6.4%).</p
    corecore