3,349 research outputs found

    Feature Selection and Dimensionality Reduction in Genomics and Proteomics

    Get PDF
    International audienceFinding reliable, meaningful patterns in data with high numbers of attributes can be extremely difficult. Feature selection helps us to decide what attributes or combination of attributes are most important for finding these patterns. In this chapter, we study feature selection methods for building classification models from high-throughput genomic (microarray) and proteomic (mass spectrometry) data sets. Thousands of feature candidates must be analyzed, compared and combined in such data sets. We describe the basics of four different approaches used for feature selection and illustrate their effects on an MS cancer proteomic data set. The closing discussion provides assistance in performing an analysis in high-dimensional genomic and proteomic data

    Efficiency Analysis of Competing Tests for Finding Differentially Expressed Genes in Lung Adenocarcinoma

    Get PDF
    In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The ‘best’ test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the ‘best’ test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria

    Tests for finding complex patterns of differential expression in cancers: towards individualized medicine

    Get PDF
    BACKGROUND: Microarray studies in cancer compare expression levels between two or more sample groups on thousands of genes. Data analysis follows a population-level approach (e.g., comparison of sample means) to identify differentially expressed genes. This leads to the discovery of 'population-level' markers, i.e., genes with the expression patterns A > B and B > A. We introduce the PPST test that identifies genes where a significantly large subset of cases exhibit expression values beyond upper and lower thresholds observed in the control samples. RESULTS: Interestingly, the test identifies A > B and B < A pattern genes that are missed by population-level approaches, such as the t-test, and many genes that exhibit both significant overexpression and significant underexpression in statistically significantly large subsets of cancer patients (ABA pattern genes). These patterns tend to show distributions that are unique to individual genes, and are aptly visualized in a 'gene expression pattern grid'. The low degree of among-gene correlations in these genes suggests unique underlying genomic pathologies and high degree of unique tumor-specific differential expression. We compare the PPST and the ABA test to the parametric and non-parametric t-test by analyzing two independently published data sets from studies of progression in astrocytoma. CONCLUSIONS: The PPST test resulted findings similar to the nonparametric t-test with higher self-consistency. These tests and the gene expression pattern grid may be useful for the identification of therapeutic targets and diagnostic or prognostic markers that are present only in subsets of cancer patients, and provide a more complete portrait of differential expression in cancer

    The colonization of land by animals: molecular phylogeny and divergence times among arthropods

    Get PDF
    BACKGROUND: The earliest fossil evidence of terrestrial animal activity is from the Ordovician, ~450 million years ago (Ma). However, there are earlier animal fossils, and most molecular clocks suggest a deep origin of animal phyla in the Precambrian, leaving open the possibility that animals colonized land much earlier than the Ordovician. To further investigate the time of colonization of land by animals, we sequenced two nuclear genes, glyceraldehyde-3-phosphate dehydrogenase and enolase, in representative arthropods and conducted phylogenetic and molecular clock analyses of those and other available DNA and protein sequence data. To assess the robustness of animal molecular clocks, we estimated the deuterostome-arthropod divergence using the arthropod fossil record for calibration and tunicate instead of vertebrate sequences to represent Deuterostomia. Nine nuclear and 15 mitochondrial genes were used in phylogenetic analyses and 61 genes were used in molecular clock analyses. RESULTS: Significant support was found for the unconventional pairing of myriapods (millipedes and centipedes) with chelicerates (spiders, scorpions, horseshoe crabs, etc.) using nuclear and mitochondrial genes. Our estimated time for the divergence of millipedes (Diplopoda) and centipedes (Chilopoda) was 442 ± 50 Ma, and the divergence of insects and crustaceans was estimated as 666 ± 58 Ma. Our results also agree with previous studies suggesting a deep divergence (~1100 – 900 Ma) for arthropods and deuterostomes, considerably predating the Cambrian Explosion seen in the animal fossil record. CONCLUSIONS: The consistent support for a close relationship between myriapods and chelicerates, using mitochondrial and nuclear genes and different methods of analysis, suggests that this unexpected result is not an artefact of analysis. We propose the name Myriochelata for this group of animals, which includes many that immobilize prey with venom. Our molecular clock analyses using arthropod fossil calibrations support earlier studies using vertebrate calibrations in finding that deuterostomes and arthropods diverged hundreds of millions of years before the Cambrian explosion. However, our molecular time estimate for the divergence of millipedes and centipedes is close to the divergence time inferred from fossils. This suggests that arthropods may have adapted to the terrestrial environment relatively late in their evolutionary history

    Null model selection, compositional bias, character state bias, and the limits of phylogenetic information

    Get PDF
    Evolutionary trends and processes can distort phylogenetic information in sequences such that they do not reliably reflect the evolutionary processes that generate them. This fact of molecular evolution has a ubiquitous influence on the ability of researchers to adequately reconstruct genealogical relationships and histories of the processes of molecular evolution. This feature of phylogenetic inference can limit the capacity of researchers to adequately specify a relevant null hypothesis for testing hypothesis of relationships, data informativeness, and processes of molecular evolution. We show how this feature of historical inference also influences the exactness of the relative apparent synapomorphy analysis (RASA) test for phylogenetic signal and demonstrate how a permutation modification of the null hypothesis can improve the robustness of the underlying distributional assumption of the test. The RASA test (using either null model) was found not only to appropriately reject the combinability of independent lines of evidence for the relationships among the Physalaemus pustulosus frog species group, but also to be more appropriately sensitive to individual uninformative data sets than commonly used tree-based measures of signal, including the consistency index, the retention index, and the permutation tail probability test statistic

    A novel SNP analysis method to detect copy number alterations with an unbiased reference signal directly from tumor samples

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genomic instability in cancer leads to abnormal genome copy number alterations (CNA) as a mechanism underlying tumorigenesis. Using microarrays and other technologies, tumor CNA are detected by comparing tumor sample CN to normal reference sample CN. While advances in microarray technology have improved detection of copy number alterations, the increase in the number of measured signals, noise from array probes, variations in signal-to-noise ratio across batches and disparity across laboratories leads to significant limitations for the accurate identification of CNA regions when comparing tumor and normal samples.</p> <p>Methods</p> <p>To address these limitations, we designed a novel "Virtual Normal" algorithm (VN), which allowed for construction of an unbiased reference signal directly from test samples within an experiment using any publicly available normal reference set as a baseline thus eliminating the need for an in-lab normal reference set.</p> <p>Results</p> <p>The algorithm was tested using an optimal, paired tumor/normal data set as well as previously uncharacterized pediatric malignant gliomas for which a normal reference set was not available. Using Affymetrix 250K Sty microarrays, we demonstrated improved signal-to-noise ratio and detected significant copy number alterations using the VN algorithm that were validated by independent PCR analysis of the target CNA regions.</p> <p>Conclusions</p> <p>We developed and validated an algorithm to provide a virtual normal reference signal directly from tumor samples and minimize noise in the derivation of the raw CN signal. The algorithm reduces the variability of assays performed across different reagent and array batches, methods of sample preservation, multiple personnel, and among different laboratories. This approach may be valuable when matched normal samples are unavailable or the paired normal specimens have been subjected to variations in methods of preservation.</p

    Virtual karyotyping with SNP microarrays reduces uncertainty in the diagnosis of renal epithelial tumors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Renal epithelial tumors are morphologically, biologically, and clinically heterogeneous. Different morphologic subtypes require specific management due to markedly different prognosis and response to therapy. Each common subtype has characteristic chromosomal gains and losses, including some with prognostic value. However, copy number information has not been readily accessible for clinical purposes and thus has not been routinely used in the diagnostic evaluation of these tumors. This information can be useful for classification of tumors with complex or challenging morphology. 'Virtual karyotypes' generated using SNP arrays can readily detect characteristic chromosomal lesions in paraffin embedded renal tumors and can be used to correctly categorize the common subtypes with performance characteristics that are amenable for routine clinical use.</p> <p>Methods</p> <p>To investigate the use of virtual karyotypes for diagnostically challenging renal epithelial tumors, we evaluated 25 archived renal neoplasms where sub-classification could not be definitively rendered based on morphology and other ancillary studies. We generated virtual karyotypes with the Affymetrix 10 K 2.0 mapping array platform and identified the presence of genomic lesions across all 22 autosomes.</p> <p>Results</p> <p>In 91% of challenging cases the virtual karyotype unambiguously detected the presence or absence of chromosomal aberrations characteristic of one of the common subtypes of renal epithelial tumors, while immunohistochemistry and fluorescent in situ hybridization had no or limited utility in the diagnosis of these tumors.</p> <p>Conclusion</p> <p>These results show that virtual karyotypes generated by SNP arrays can be used as a practical ancillary study for the classification of renal epithelial tumors with complex or ambiguous morphology.</p
    corecore