213,482 research outputs found

    Identification of gene-gene interaction using principal components

    Get PDF
    After more than 200 genome-wide association studies, there have been some successful identifications of a single novel locus. Thus, the identification of single-nucleotide polymorphisms (SNP) with interaction effects is of interest. Using the Genetic Analysis Workshop 16 data from the North American Rheumatoid Arthritis Consortium, we propose an approach to screen for SNP-SNP interaction using a two-stage method and an approach for detecting gene-gene interactions using principal components. We selected a set of 17 rheumatoid arthritis candidate genes to assess both approaches. Our approach using principal components holds promise in detecting gene-gene interactions. However, further study is needed to evaluate the power and the feasibility for a whole genome-wide association analysis using the principal components approach

    Gene Expression Profiling Predicts Survival in Conventional Renal Cell Carcinoma

    Get PDF
    BACKGROUND: Conventional renal cell carcinoma (cRCC) accounts for most of the deaths due to kidney cancer. Tumor stage, grade, and patient performance status are used currently to predict survival after surgery. Our goal was to identify gene expression features, using comprehensive gene expression profiling, that correlate with survival. METHODS AND FINDINGS: Gene expression profiles were determined in 177 primary cRCCs using DNA microarrays. Unsupervised hierarchical clustering analysis segregated cRCC into five gene expression subgroups. Expression subgroup was correlated with survival in long-term follow-up and was independent of grade, stage, and performance status. The tumors were then divided evenly into training and test sets that were balanced for grade, stage, performance status, and length of follow-up. A semisupervised learning algorithm (supervised principal components analysis) was applied to identify transcripts whose expression was associated with survival in the training set, and the performance of this gene expression-based survival predictor was assessed using the test set. With this method, we identified 259 genes that accurately predicted disease-specific survival among patients in the independent validation group (p < 0.001). In multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status (p < 0.001). CONCLUSIONS: cRCC displays molecular heterogeneity and can be separated into gene expression subgroups that correlate with survival after surgery. We have identified a set of 259 genes that predict survival after surgery independent of clinical prognostic factors

    Robust imputation method for missing values in microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray gene expression data, missing values are often encountered. Most multivariate statistical methods proposed for microarray data analysis cannot be applied when the data have missing values. Numerous imputation algorithms have been proposed to estimate the missing values. In this study, we develop a robust least squares estimation with principal components (RLSP) method by extending the local least square imputation (LLSimpute) method. The basic idea of our method is to employ quantile regression to estimate the missing values, using the estimated principal components of a selected set of similar genes.</p> <p>Results</p> <p>Using the normalized root mean squares error, the performance of the proposed method was evaluated and compared with other previously proposed imputation methods. The proposed RLSP method clearly outperformed the weighted <it>k</it>-nearest neighbors imputation (kNNimpute) method and LLSimpute method, and showed competitive results with Bayesian principal component analysis (BPCA) method.</p> <p>Conclusion</p> <p>Adapting the principal components of the selected genes and employing the quantile regression model improved the robustness and accuracy of missing value imputation. Thus, the proposed RLSP method is, according to our empirical studies, more robust and accurate than the widely used kNNimpute and LLSimpute methods.</p

    SVD-based Anatomy of Gene Expressions for Correlation Analysis in Arabidopsis thaliana

    Get PDF
    Gene co-expression analysis has been widely used in recent years for predicting unknown gene function and its regulatory mechanisms. The predictive accuracy depends on the quality and the diversity of data set used. In this report, we applied singular value decomposition (SVD) to array experiments in public databases to find that co-expression linkage could be estimated by a much smaller number of array data. Correlations of co-expressed gene were assessed using two regulatory mechanisms (feedback loop of the fundamental circadian clock and a global transcription factor Myb28), as well as metabolic pathways in the AraCyc database. Our conclusion is that a smaller number of informative arrays across tissues can suffice to reproduce comparable results with a state-of-the-art co-expression software tool. In our SVD analysis on Arabidopsis data set, array experiments that contributed most as the principal components included stamen development, germinating seed and stress responses on leaf

    Leukemia and small round blue-cell tumor cancer detection using microarray gene expression data set: Combining data dimension reduction and variable selection technique

    Get PDF
    Using gene expression data in cancer classification plays an important role for solving the fundamental problems relating to cancer diagnosis. Because of high throughput of gene expression data for healthy and patient samples, a variable selection method can be applied to reduce complexity of the model and improve the classification performance. Since variable selection procedures pose a risk of over-fitting, when a large number of variables with respect to sample are used,we have proposed a method for coupling data dimension reduction and variable selection in the present study. This approach uses the concept of variable clustering for the original data set. Significant components of local principal component analysis models have just been retained from all clusters. Then, the variable selection algorithm is performed on these locally derived principal component variables. The proposed algorithm has been evaluated on two gene expression data sets; namely, acute Leukemia and small round blue-cell tumor (SRBCT). Our results confirmed that the classification models achieved on the reduced data were better than those obtained on the entire microarray gene expression profile

    IonFlow: a galaxy tool for the analysis of ionomics data sets.

    Get PDF
    INTRODUCTION: Inductively coupled plasma mass spectrometry (ICP-MS) experiments generate complex multi-dimensional data sets that require specialist data analysis tools. OBJECTIVE: Here we describe tools to facilitate analysis of the ionome composed of high-throughput elemental profiling data. METHODS: IonFlow is a Galaxy tool written in R for ionomics data analysis and is freely accessible at https://github.com/wanchanglin/ionflow . It is designed as a pipeline that can process raw data to enable exploration and interpretation using multivariate statistical techniques and network-based algorithms, including principal components analysis, hierarchical clustering, relevance network extraction and analysis, and gene set enrichment analysis. RESULTS AND CONCLUSION: The pipeline is described and tested on two benchmark data sets of the haploid S. Cerevisiae ionome and of the human HeLa cell ionome

    A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data

    Get PDF
    Many research teams perform numerous genetic, transcriptomic, proteomic and other types of omic experiments to understand molecular, cellular and physiological mechanisms of disease and health. Often (but not always), the results of these experiments are deposited in publicly available repository databases. These data records often include phenotypic characteristics following genetic and environmental perturbations, with the aim of discovering underlying molecular mechanisms leading to the phenotypic responses. A constrained set of phenotypic characteristics is usually recorded and these are mostly hypothesis driven of possible to record within financial or practical constraints. We present a novel proof-of-principal computational approach for combining publicly available gene-expression data from control/mutant animal experiments that exhibit a particular phenotype, and we use this approach to predict unobserved phenotypic characteristics in new experiments (data derived from EBI’s ArrayExpress and ExpressionAtlas respectively). We utilised available microarray gene-expression data for two phenotypes (starvation-sensitive and sterile) in Drosophila. The data were combined using a linear-mixed effects model with the inclusion of consecutive principal components to account for variability between experiments in conjunction with Gene Ontology enrichment analysis. We present how available data can be ranked in accordance to a phenotypic likelihood of exhibiting these two phenotypes using random forest. The results from our study show that it is possible to integrate seemingly different gene-expression microarray data and predict a potential phenotypic manifestation with a relatively high degree of confidence (>80% AUC). This provides thus far unexplored opportunities for inferring unknown and unbiased phenotypic characteristics from already performed experiments, in order to identify studies for future analyses. Molecular mechanisms associated with gene and environment perturbations are intrinsically linked and give rise to a variety of phenotypic manifestations. Therefore, unravelling the phenotypic spectrum can help to gain insights into disease mechanisms associated with gene and environmental perturbations. Our approach uses public data that are set to increase in volume, thus providing value for money

    A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data

    Get PDF
    Many research teams perform numerous genetic, transcriptomic, proteomic and other types of omic experiments to understand molecular, cellular and physiological mechanisms of disease and health. Often (but not always), the results of these experiments are deposited in publicly available repository databases. These data records often include phenotypic characteristics following genetic and environmental perturbations, with the aim of discovering underlying molecular mechanisms leading to the phenotypic responses. A constrained set of phenotypic characteristics is usually recorded and these are mostly hypothesis driven of possible to record within financial or practical constraints. We present a novel proof-of-principal computational approach for combining publicly available gene-expression data from control/mutant animal experiments that exhibit a particular phenotype, and we use this approach to predict unobserved phenotypic characteristics in new experiments (data derived from EBI’s ArrayExpress and ExpressionAtlas respectively). We utilised available microarray gene-expression data for two phenotypes (starvation-sensitive and sterile) in Drosophila. The data were combined using a linear-mixed effects model with the inclusion of consecutive principal components to account for variability between experiments in conjunction with Gene Ontology enrichment analysis. We present how available data can be ranked in accordance to a phenotypic likelihood of exhibiting these two phenotypes using random forest. The results from our study show that it is possible to integrate seemingly different gene-expression microarray data and predict a potential phenotypic manifestation with a relatively high degree of confidence (>80% AUC). This provides thus far unexplored opportunities for inferring unknown and unbiased phenotypic characteristics from already performed experiments, in order to identify studies for future analyses. Molecular mechanisms associated with gene and environment perturbations are intrinsically linked and give rise to a variety of phenotypic manifestations. Therefore, unravelling the phenotypic spectrum can help to gain insights into disease mechanisms associated with gene and environmental perturbations. Our approach uses public data that are set to increase in volume, thus providing value for money

    ERBB3 is a marker of a ganglioneuroblastoma/ganglioneuroma-like expression profile in neuroblastic tumours

    Get PDF
    Background: Neuroblastoma (NB) tumours are commonly divided into three cytogenetic subgroups. However, by unsupervised principal components analysis of gene expression profiles we recently identified four distinct subgroups, r1-r4. In the current study we characterized these different subgroups in more detail, with a specific focus on the fourth divergent tumour subgroup (r4). Methods: Expression microarray data from four international studies corresponding to 148 neuroblastic tumour cases were subject to division into four expression subgroups using a previously described 6-gene signature. Differentially expressed genes between groups were identified using Significance Analysis of Microarray (SAM). Next, gene expression network modelling was performed to map signalling pathways and cellular processes representing each subgroup. Findings were validated at the protein level by immunohistochemistry and immunoblot analyses. Results: We identified several significantly up-regulated genes in the r4 subgroup of which the tyrosine kinase receptor ERBB3 was most prominent (fold change: 132–240). By gene set enrichment analysis (GSEA) the constructed gene network of ERBB3 (n = 38 network partners) was significantly enriched in the r4 subgroup in all four independent data sets. ERBB3 was also positively correlated to the ErbB family members EGFR and ERBB2 in all data sets, and a concurrent overexpression was seen in the r4 subgroup. Further studies of histopathology categories using a fifth data set of 110 neuroblastic tumours, showed a striking similarity between the expression profile of r4 to ganglioneuroblastoma (GNB) and ganglioneuroma (GN) tumours. In contrast, the NB histopathological subtype was dominated by mitotic regulating genes, characterizing unfavourable NB subgroups in particular. The high ErbB3 expression in GN tumour types was verified at the protein level, and showed mainly expression in the mature ganglion cells. Conclusions: Conclusively, this study demonstrates the importance of performing unsupervised clustering and subtype discovery of data sets prior to analyses to avoid a mixture of tumour subtypes, which may otherwise give distorted results and lead to incorrect conclusions. The current study identifies ERBB3 as a clear-cut marker of a GNB/GN-like expression profile, and we suggest a 7-gene expression signature (including ERBB3) as a complement to histopathology analysis of neuroblastic tumours. Further studies of ErbB3 and other ErbB family members and their role in neuroblastic differentiation and pathogenesis are warranted

    Exploring pleiotropy using principal components

    Get PDF
    A standard multivariate principal components (PCs) method was utilized to identify clusters of variables that may be controlled by a common gene or genes (pleiotropy). Heritability estimates were obtained and linkage analyses performed on six individual traits (total cholesterol (Chol), high and low density lipoproteins, triglycerides (TG), body mass index (BMI), and systolic blood pressure (SBP)) and on each PC to compare our ability to identify major gene effects. Using the simulated data from Genetic Analysis Workshop 13 (Cohort 1 and 2 data for year 11), the quantitative traits were first adjusted for age, sex, and smoking (cigarettes per day). Adjusted variables were standardized and PCs calculated followed by orthogonal transformation (varimax rotation). Rotated PCs were then subjected to heritability and quantitative multipoint linkage analysis. The first three PCs explained 73% of the total phenotypic variance. Heritability estimates were above 0.60 for all three PCs. We performed linkage analyses on the PCs as well as the individual traits. The majority of pleiotropic and trait-specific genes were not identified. Standard PCs analysis methods did not facilitate the identification of pleiotropic genes affecting the six traits examined in the simulated data set. In addition, genes contributing 20% of the variance in traits with over 0.60 heritability estimates could not be identified in this simulated data set using traditional quantitative trait linkage analyses. Lack of identification of pleiotropic and trait-specific genes in some cases may reflect their low contribution to the traits/PCs examined or more importantly, characteristics of the sample group analyzed, and not simply a failure of the PC approach itself
    corecore