2,015 research outputs found

    Feature Selection and Dimensionality Reduction in Genomics and Proteomics

    Get PDF
    International audienceFinding reliable, meaningful patterns in data with high numbers of attributes can be extremely difficult. Feature selection helps us to decide what attributes or combination of attributes are most important for finding these patterns. In this chapter, we study feature selection methods for building classification models from high-throughput genomic (microarray) and proteomic (mass spectrometry) data sets. Thousands of feature candidates must be analyzed, compared and combined in such data sets. We describe the basics of four different approaches used for feature selection and illustrate their effects on an MS cancer proteomic data set. The closing discussion provides assistance in performing an analysis in high-dimensional genomic and proteomic data

    Efficiency Analysis of Competing Tests for Finding Differentially Expressed Genes in Lung Adenocarcinoma

    Get PDF
    In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The ‘best’ test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the ‘best’ test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria

    Null model selection, compositional bias, character state bias, and the limits of phylogenetic information

    Get PDF
    Evolutionary trends and processes can distort phylogenetic information in sequences such that they do not reliably reflect the evolutionary processes that generate them. This fact of molecular evolution has a ubiquitous influence on the ability of researchers to adequately reconstruct genealogical relationships and histories of the processes of molecular evolution. This feature of phylogenetic inference can limit the capacity of researchers to adequately specify a relevant null hypothesis for testing hypothesis of relationships, data informativeness, and processes of molecular evolution. We show how this feature of historical inference also influences the exactness of the relative apparent synapomorphy analysis (RASA) test for phylogenetic signal and demonstrate how a permutation modification of the null hypothesis can improve the robustness of the underlying distributional assumption of the test. The RASA test (using either null model) was found not only to appropriately reject the combinability of independent lines of evidence for the relationships among the Physalaemus pustulosus frog species group, but also to be more appropriately sensitive to individual uninformative data sets than commonly used tree-based measures of signal, including the consistency index, the retention index, and the permutation tail probability test statistic

    Tests for finding complex patterns of differential expression in cancers: towards individualized medicine

    Get PDF
    BACKGROUND: Microarray studies in cancer compare expression levels between two or more sample groups on thousands of genes. Data analysis follows a population-level approach (e.g., comparison of sample means) to identify differentially expressed genes. This leads to the discovery of 'population-level' markers, i.e., genes with the expression patterns A > B and B > A. We introduce the PPST test that identifies genes where a significantly large subset of cases exhibit expression values beyond upper and lower thresholds observed in the control samples. RESULTS: Interestingly, the test identifies A > B and B < A pattern genes that are missed by population-level approaches, such as the t-test, and many genes that exhibit both significant overexpression and significant underexpression in statistically significantly large subsets of cancer patients (ABA pattern genes). These patterns tend to show distributions that are unique to individual genes, and are aptly visualized in a 'gene expression pattern grid'. The low degree of among-gene correlations in these genes suggests unique underlying genomic pathologies and high degree of unique tumor-specific differential expression. We compare the PPST and the ABA test to the parametric and non-parametric t-test by analyzing two independently published data sets from studies of progression in astrocytoma. CONCLUSIONS: The PPST test resulted findings similar to the nonparametric t-test with higher self-consistency. These tests and the gene expression pattern grid may be useful for the identification of therapeutic targets and diagnostic or prognostic markers that are present only in subsets of cancer patients, and provide a more complete portrait of differential expression in cancer

    The Degree of Segmental Aneuploidy Measured by Total Copy Number Abnormalities Predicts Survival and Recurrence in Superficial Gastroesophageal Adenocarcinoma

    Get PDF
    Abstract Background: Prognostic biomarkers are needed for superficial gastroesophageal adenocarcinoma (EAC) to predict clinical outcomes and select therapy. Although recurrent mutations have been characterized in EAC, little is known about their clinical and prognostic significance. Aneuploidy is predictive of clinical outcome in many malignancies but has not been evaluated in superficial EAC

    Clinical decision modeling system

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Decision analysis techniques can be applied in complex situations involving uncertainty and the consideration of multiple objectives. Classical decision modeling techniques require elicitation of too many parameter estimates and their conditional (joint) probabilities, and have not therefore been applied to the problem of identifying high-performance, cost-effective combinations of clinical options for diagnosis or treatments where many of the objectives are unknown or even unspecified.</p> <p>Methods</p> <p>We designed a Java-based software resource, the Clinical Decision Modeling System (CDMS), to implement Naïve Decision Modeling, and provide a use case based on published performance evaluation measures of various strategies for breast and lung cancer detection. Because cost estimates for many of the newer methods are not yet available, we assume equal cost. Our use case reveals numerous potentially high-performance combinations of clinical options for the detection of breast and lung cancer.</p> <p>Results</p> <p>Naïve Decision Modeling is a highly practical applied strategy which guides investigators through the process of establishing evidence-based integrative translational clinical research priorities. CDMS is not designed for clinical decision support. Inputs include performance evaluation measures and costs of various clinical options. The software finds trees with expected emergent performance characteristics and average cost per patient that meet stated filtering criteria. Key to the utility of the software is sophisticated graphical elements, including a tree browser, a receiver-operator characteristic surface plot, and a histogram of expected average cost per patient. The analysis pinpoints the potentially most relevant pairs of clinical options ('critical pairs') for which empirical estimates of conditional dependence may be critical. The assumption of independence can be tested with retrospective studies prior to the initiation of clinical trials designed to estimate clinical impact. High-performance combinations of clinical options may exist for breast and lung cancer detection.</p> <p>Conclusion</p> <p>The software could be found useful in simplifying the objective-driven planning of complex integrative clinical studies without requiring a multi-attribute utility function, and it could lead to efficient integrative translational clinical study designs that move beyond simple pair wise competitive studies. Collaborators, who traditionally might compete to prioritize their own individual clinical options, can use the software as a common framework and guide to work together to produce increased understanding on the benefits of using alternative clinical combinations to affect strategic and cost-effective clinical workflows.</p

    The degree of segmental aneuploidy measured by total copy number abnormalities predicts survival and recurrence in superficial gastroesophageal adenocarcinoma

    Get PDF
    Background: Prognostic biomarkers are needed for superficial gastroesophageal adenocarcinoma (EAC) to predict clinical outcomes and select therapy. Although recurrent mutations have been characterized in EAC, little is known about their clinical and prognostic significance. Aneuploidy is predictive of clinical outcome in many malignancies but has not been evaluated in superficial EAC. Methods: We quantified copy number changes in 41 superficial EAC using Affymetrix SNP 6.0 arrays. We identified recurrent chromosomal gains and losses and calculated the total copy number abnormality (CNA) count for each tumor as a measure of aneuploidy. We correlated CNA count with overall survival and time to first recurrence in univariate and multivariate analyses. Results: Recurrent segmental gains and losses involved multiple genes, including: HER2, EGFR, MET, CDK6, KRAS (recurrent gains); and FHIT, WWOX, CDKN2A/B, SMAD4, RUNX1 (recurrent losses). There was a 40-fold variation in CNA count across all cases. Tumors with the lowest and highest quartile CNA count had significantly better overall survival (p = 0.032) and time to first recurrence (p = 0.010) compared to those with intermediate CNA counts. These associations persisted when controlling for other prognostic variables. Significance: SNP arrays facilitate the assessment of recurrent chromosomal gain and loss and allow high resolution, quantitative assessment of segmental aneuploidy (total CNA count). The non-monotonic association of segmental aneuploidy with survival has been described in other tumors. The degree of aneuploidy is a promising prognostic biomarker in a potentially curable form of EAC. © 2014 Davison et al
    corecore