17 research outputs found

    Predicting rice phenotypes with meta and multi-target learning

    Get PDF
    The features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case

    Predicting rice phenotypes with meta and multi-target learning

    Get PDF
    Abstract: The features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case

    GC3 biology in corn, rice, sorghum and other grasses

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The third, or wobble, position in a codon provides a high degree of possible degeneracy and is an elegant fault-tolerance mechanism. Nucleotide biases between organisms at the wobble position have been documented and correlated with the abundances of the complementary tRNAs. We and others have noticed a bias for cytosine and guanine at the third position in a subset of transcripts within a single organism. The bias is present in some plant species and warm-blooded vertebrates but not in all plants, or in invertebrates or cold-blooded vertebrates.</p> <p>Results</p> <p>Here we demonstrate that in certain organisms the amount of GC at the wobble position (GC<sub>3</sub>) can be used to distinguish two classes of genes. We highlight the following features of genes with high GC<sub>3 </sub>content: they (1) provide more targets for methylation, (2) exhibit more variable expression, (3) more frequently possess upstream TATA boxes, (4) are predominant in certain classes of genes (e.g., stress responsive genes) and (5) have a GC<sub>3 </sub>content that increases from 5'to 3'. These observations led us to formulate a hypothesis to explain GC<sub>3 </sub>bimodality in grasses.</p> <p>Conclusions</p> <p>Our findings suggest that high levels of GC<sub>3 </sub>typify a class of genes whose expression is regulated through DNA methylation or are a legacy of accelerated evolution through gene conversion. We discuss the three most probable explanations for GC<sub>3 </sub>bimodality: biased gene conversion, transcriptional and translational advantage and gene methylation.</p

    Insights into corn genes derived from large-scale cDNA sequencing

    Get PDF
    We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701–EU977132 (FLI cDNA) and FK944382-FL482108 (EST)

    Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials

    No full text
    We propose new empirical scoring potentials and associated alignment procedures for optimally aligning protein sequences to protein structures. The method has two main applications: first, the recognition of a plausible fold for a protein sequence of unknown structure out of a database of representative protein structures and, second, the improvement of sequence alignments by using structural information in order to find a better starting point for homology based modelling. The empirical scoring function is derived from an analysis of a non-- redundant database of known structures by converting relative frequencies into pseudoenergies using a normalization according to the inverse Boltzmann law. These -- so called contact capacity -- potentials turn out to be discriminative enough to detect structural folds in the absence of significant sequence similarity and at the same time simple enough to allow for a very fast optimization in an alignment procedure. 1 Introduction and Problem Defi..

    Discovery of genomic variants associated with genebank historical traits for rice improvement: SNP and indel data, phenotypic data, and GWAS results

    No full text
    This dataset provides supporting information for Sanciangco et al (submitted) consisting of: A) file list, tables of phenotypes for quantitative and categorical traits and trait descriptions, and tables of SNP/indel numbers for Filtered, LD-pruned and subpopulation datasets (7 files named as "00_*"); B) plink files for Filtered and LD-pruned SNP/indel datasets for all genotypes and for indica, japonica and aus subsets (15 fIles named as "01_*"); C) EMMAX results on Filtered dataset for 12 quantitative traits on All, Aus, Indica, and Japonica genotypes and corresponding Manhattan and QQ plots (144 files named as "0[2345]_*"); D) EMMAX results on LD-pruned dataset for 12 quantitative traits on All, Aus, Indica, and Japonica genotypes and corresponding Manhattan and QQ plots (72 files named as "0[6789]_*"); E) EMMAX results on LD-pruned dataset for 20 categorical traits treated as numeric on All genotypes and corresponding Manhattan and Q-Q plots (60 files named as "10_*"); F) Anova results obtained on numerically transformed LD-pruned dataset for 20 categorical traits on All genotypes and corresponding Manhattan plots (40 files named as "11_*")
    corecore