310 research outputs found

    Phenotype Prediction Using Regularized Regression on Genetic Data in the DREAM5 Systems Genetics B Challenge

    Get PDF
    A major goal of large-scale genomics projects is to enable the use of data from high-throughput experimental methods to predict complex phenotypes such as disease susceptibility. The DREAM5 Systems Genetics B Challenge solicited algorithms to predict soybean plant resistance to the pathogen Phytophthora sojae from training sets including phenotype, genotype, and gene expression data. The challenge test set was divided into three subcategories, one requiring prediction based on only genotype data, another on only gene expression data, and the third on both genotype and gene expression data. Here we present our approach, primarily using regularized regression, which received the best-performer award for subchallenge B2 (gene expression only). We found that despite the availability of 941 genotype markers and 28,395 gene expression features, optimal models determined by cross-validation experiments typically used fewer than ten predictors, underscoring the importance of strong regularization in noisy datasets with far more features than samples. We also present substantial analysis of the training and test setup of the challenge, identifying high variance in performance on the gold standard test sets.National Science Foundation (U.S.). Graduate Research Fellowship ProgramNational Defense Science and Engineering Graduate Fellowshi

    Gene expression patterns associated with p53 status in breast cancer

    Get PDF
    BACKGROUND: Breast cancer subtypes identified in genomic studies have different underlying genetic defects. Mutations in the tumor suppressor p53 occur more frequently in estrogen receptor (ER) negative, basal-like and HER2-amplified tumors than in luminal, ER positive tumors. Thus, because p53 mutation status is tightly linked to other characteristics of prognostic importance, it is difficult to identify p53's independent prognostic effects. The relation between p53 status and subtype can be better studied by combining data from primary tumors with data from isogenic cell line pairs (with and without p53 function). METHODS: The p53-dependent gene expression signatures of four cell lines (MCF-7, ZR-75-1, and two immortalized human mammary epithelial cell lines) were identified by comparing p53-RNAi transduced cell lines to their parent cell lines. Cell lines were treated with vehicle only or doxorubicin to identify p53 responses in both non-induced and induced states. The cell line signatures were compared with p53-mutation associated genes in breast tumors. RESULTS: Each cell line displayed distinct patterns of p53-dependent gene expression, but cell type specific (basal vs. luminal) commonalities were evident. Further, a common gene expression signature associated with p53 loss across all four cell lines was identified. This signature showed overlap with the signature of p53 loss/mutation status in primary breast tumors. Moreover, the common cell-line tumor signature excluded genes that were breast cancer subtype-associated, but not downstream of p53. To validate the biological relevance of the common signature, we demonstrated that this gene set predicted relapse-free, disease-specific, and overall survival in independent test data. CONCLUSION: In the presence of breast cancer heterogeneity, experimental and biologically-based methods for assessing gene expression in relation to p53 status provide prognostic and biologically-relevant gene lists. Our biologically-based refinements excluded genes that were associated with subtype but not downstream of p53 signaling, and identified a signature for p53 loss that is shared across breast cancer subtypes

    A CD8+ T cell transcription signature predicts prognosis in autoimmune disease.

    Get PDF
    Autoimmune diseases are common and debilitating, but their severe manifestations could be reduced if biomarkers were available to allow individual tailoring of potentially toxic immunosuppressive therapy. Gene expression-based biomarkers facilitating such tailoring of chemotherapy in cancer, but not autoimmunity, have been identified and translated into clinical practice. We show that transcriptional profiling of purified CD8(+) T cells, which avoids the confounding influences of unseparated cells, identifies two distinct subject subgroups predicting long-term prognosis in two autoimmune diseases, antineutrophil cytoplasmic antibody (ANCA)-associated vasculitis (AAV), a chronic, severe disease characterized by inflammation of medium-sized and small blood vessels, and systemic lupus erythematosus (SLE), characterized by autoantibodies, immune complex deposition and diverse clinical manifestations ranging from glomerulonephritis to neurological dysfunction. We show that the subset of genes defining the poor prognostic group is enriched for genes involved in the interleukin-7 receptor (IL-7R) pathway and T cell receptor (TCR) signaling and those expressed by memory T cells. Furthermore, the poor prognostic group is associated with an expanded CD8(+) T cell memory population. These subgroups, which are also found in the normal population and can be identified by measuring expression of only three genes, raise the prospect of individualized therapy and suggest new potential therapeutic targets in autoimmunity

    E2F1 and KIAA0191 expression predicts breast cancer patient survival

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene expression profiling of human breast tumors has uncovered several molecular signatures that can divide breast cancer patients into good and poor outcome groups. However, these signatures typically comprise many genes (~50-100), and the prognostic tests associated with identifying these signatures in patient tumor specimens require complicated methods, which are not routinely available in most hospital pathology laboratories, thus limiting their use. Hence, there is a need for more practical methods to predict patient survival.</p> <p>Methods</p> <p>We modified a feature selection algorithm and used survival analysis to derive a 2-gene signature that accurately predicts breast cancer patient survival.</p> <p>Results</p> <p>We developed a tree based decision method that segregated patients into various risk groups using <it>KIAA0191 </it>expression in the context of <it>E2F1 </it>expression levels. This approach led to highly accurate survival predictions in a large cohort of breast cancer patients using only a 2-gene signature.</p> <p>Conclusions</p> <p>Our observations suggest a possible relationship between <it>E2F1 </it>and <it>KIAA0191 </it>expression that is relevant to the pathogenesis of breast cancer. Furthermore, our findings raise the prospect that the practicality of patient prognosis methods may be improved by reducing the number of genes required for analysis. Indeed, our <it>E2F1/KIAA0191 </it>2-gene signature would be highly amenable for an immunohistochemistry based test, which is commonly used in hospital laboratories.</p

    Comparison of EGFR and K-RAS gene status between primary tumours and corresponding metastases in NSCLC

    Get PDF
    In non-small-cell lung cancer (NSCLC), epidermal growth factor receptor (EGFR) and K-RAS mutations of the primary tumour are associated with responsiveness and resistance to tyrosine kinase inhibitors (TKIs), respectively. However, the EGFR and K-RAS mutation status in metastases is not well studied. We compared the mutation status of these genes between the primary tumours and the corresponding metastases of 25 patients. Epidermal growth factor receptor and K-RAS mutation status was different between primary tumours and corresponding metastases in 7 (28%) and 6 (24%) of the 25 patients, respectively. Among the 25 primary tumours, three ‘hotspot' and two non-classical EGFR mutations were found; none of the corresponding metastases had the same mutation pattern. Among the five (20%) K-RAS mutations detected in the primary tumours, two were maintained in the corresponding metastasis. Epidermal growth factor receptor and K-RAS mutations were detected in the metastatic tumours of three (12%) and five (20%) patients, respectively. The expressions of EGFR and phosphorylated EGFR showed 10 and 50% discordance, in that order. We conclude that there is substantial discordance in EGFR and K-RAS mutational status between the primary tumours and corresponding metastases in patients with NSCLC and this might have therapeutic implications when treatment with TKIs is considered

    Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

    Get PDF
    &lt;b&gt;Background&lt;/b&gt; The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.&lt;p&gt;&lt;/p&gt; &lt;b&gt;Results&lt;/b&gt; We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets.&lt;p&gt;&lt;/p&gt; &lt;b&gt;Conclusions&lt;/b&gt; The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis

    Classification of heterogeneous microarray data by maximum entropy kernel

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There is a large amount of microarray data accumulating in public databases, providing various data waiting to be analyzed jointly. Powerful kernel-based methods are commonly used in microarray analyses with support vector machines (SVMs) to approach a wide range of classification problems. However, the standard vectorial data kernel family (linear, RBF, etc.) that takes vectorial data as input, often fails in prediction if the data come from different platforms or laboratories, due to the low gene overlaps or consistencies between the different datasets.</p> <p>Results</p> <p>We introduce a new type of kernel called maximum entropy (ME) kernel, which has no pre-defined function but is generated by kernel entropy maximization with sample distance matrices as constraints, into the field of SVM classification of microarray data. We assessed the performance of the ME kernel with three different data: heterogeneous kidney carcinoma, noise-introduced leukemia, and heterogeneous oral cavity carcinoma metastasis data. The results clearly show that the ME kernel is very robust for heterogeneous data containing missing values and high-noise, and gives higher prediction accuracies than the standard kernels, namely, linear, polynomial and RBF.</p> <p>Conclusion</p> <p>The results demonstrate its utility in effectively analyzing promiscuous microarray data of rare specimens, e.g., minor diseases or species, that present difficulty in compiling homogeneous data in a single laboratory.</p

    Assessment of a 44 Gene Classifier for the Evaluation of Chronic Fatigue Syndrome from Peripheral Blood Mononuclear Cell Gene Expression

    Get PDF
    Chronic fatigue syndrome (CFS) is a clinically defined illness estimated to affect millions of people worldwide causing significant morbidity and an annual cost of billions of dollars. Currently there are no laboratory-based diagnostic methods for CFS. However, differences in gene expression profiles between CFS patients and healthy persons have been reported in the literature. Using mRNA relative quantities for 44 previously identified reporter genes taken from a large dataset comprising both CFS patients and healthy volunteers, we derived a gene profile scoring metric to accurately classify CFS and healthy samples. This metric out-performed any of the reporter genes used individually as a classifier of CFS. To determine whether the reporter genes were robust across populations, we applied this metric to classify a separate blind dataset of mRNA relative quantities from a new population of CFS patients and healthy persons with limited success. Although the metric was able to successfully classify roughly two-thirds of both CFS and healthy samples correctly, the level of misclassification was high. We conclude many of the previously identified reporter genes are study-specific and thus cannot be used as a broad CFS diagnostic

    Core module biomarker identification with network exploration for breast cancer metastasis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In a complex disease, the expression of many genes can be significantly altered, leading to the appearance of a differentially expressed "disease module". Some of these genes directly correspond to the disease phenotype, (i.e. "driver" genes), while others represent closely-related first-degree neighbours in gene interaction space. The remaining genes consist of further removed "passenger" genes, which are often not directly related to the original cause of the disease. For prognostic and diagnostic purposes, it is crucial to be able to separate the group of "driver" genes and their first-degree neighbours, (i.e. "core module") from the general "disease module".</p> <p>Results</p> <p>We have developed COMBINER: COre Module Biomarker Identification with Network ExploRation. COMBINER is a novel pathway-based approach for selecting highly reproducible discriminative biomarkers. We applied COMBINER to three benchmark breast cancer datasets for identifying prognostic biomarkers. COMBINER-derived biomarkers exhibited 10-fold higher reproducibility than other methods, with up to 30-fold greater enrichment for known cancer-related genes, and 4-fold enrichment for known breast cancer susceptible genes. More than 50% and 40% of the resulting biomarkers were cancer and breast cancer specific, respectively. The identified modules were overlaid onto a map of intracellular pathways that comprehensively highlighted the hallmarks of cancer. Furthermore, we constructed a global regulatory network intertwining several functional clusters and uncovered 13 confident "driver" genes of breast cancer metastasis.</p> <p>Conclusions</p> <p>COMBINER can efficiently and robustly identify disease core module genes and construct their associated regulatory network. In the same way, it is potentially applicable in the characterization of any disease that can be probed with microarrays.</p

    Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes.

    Get PDF
    A considerable body of research indicates that mammary gland branching morphogenesis is dependent, in part, on the extracellular matrix (ECM), ECM-receptors, such as integrins and other ECM receptors, and ECM-degrading enzymes, including matrix metalloproteinases (MMPs) and their inhibitors, tissue inhibitors of metalloproteinases (TIMPs). There is some evidence that these ECM cues affect one or more of the following processes: cell survival, polarity, proliferation, differentiation, adhesion, and migration. Both three-dimensional culture models and genetic manipulations of the mouse mammary gland have been used to study the signaling pathways that affect these processes. However, the precise mechanisms of ECM-directed mammary morphogenesis are not well understood. Mammary morphogenesis involves epithelial 'invasion' of adipose tissue, a process akin to invasion by breast cancer cells, although the former is a highly regulated developmental process. How these morphogenic pathways are integrated in the normal gland and how they become dysregulated and subverted in the progression of breast cancer also remain largely unanswered questions
    corecore