1,049,345 research outputs found
Robustness of Random Forest-based gene selection methods
Gene selection is an important part of microarray data analysis because it
provides information that can lead to a better mechanistic understanding of an
investigated phenomenon. At the same time, gene selection is very difficult
because of the noisy nature of microarray data. As a consequence, gene
selection is often performed with machine learning methods. The Random Forest
method is particularly well suited for this purpose. In this work, four
state-of-the-art Random Forest-based feature selection methods were compared in
a gene selection context. The analysis focused on the stability of selection
because, although it is necessary for determining the significance of results,
it is often ignored in similar studies.
The comparison of post-selection accuracy in the validation of Random Forest
classifiers revealed that all investigated methods were equivalent in this
context. However, the methods substantially differed with respect to the number
of selected genes and the stability of selection. Of the analysed methods, the
Boruta algorithm predicted the most genes as potentially important.
The post-selection classifier error rate, which is a frequently used measure,
was found to be a potentially deceptive measure of gene selection quality. When
the number of consistently selected genes was considered, the Boruta algorithm
was clearly the best. Although it was also the most computationally intensive
method, the Boruta algorithm's computational demands could be reduced to levels
comparable to those of other algorithms by replacing the Random Forest
importance with a comparable measure from Random Ferns (a similar but
simplified classifier). Despite their design assumptions, the minimal optimal
selection methods, were found to select a high fraction of false positives
Positive selection underlies Faster-Z evolution of gene expression in birds.
The elevated rate of evolution for genes on sex chromosomes compared to autosomes (Fast-X or Fast-Z evolution) can result either from positive selection in the heterogametic sex, or from non-adaptive consequences of reduced relative effective population size. Recent work in birds suggests that Fast-Z of coding sequence is primarily due to relaxed purifying selection resulting from reduced relative effective population size. However, gene sequence and gene expression are often subject to distinct evolutionary pressures, therefore we tested for Fast-Z in gene expression using next-generation RNA-sequencing data from multiple avian species. Similar to studies of Fast-Z in coding sequence, we recover clear signatures of Fast-Z in gene expression, however in contrast to coding sequence, our data indicate that Fast-Z in expression is due to positive selection acting primarily in females. In the soma, where gene expression is highly correlated between the sexes, we detected Fast-Z in both sexes, although at a higher rate in females, suggesting that many positively selected expression changes in females are also expressed in males. In the gonad, where inter-sexual correlations in expression are much lower, we detected Fast-Z for female gene expression, but crucially, not males. This suggests that a large amount of expression variation is sex-specific in its effects within the gonad. Taken together, our results indicate that Fast-Z evolution of gene expression is the product of positive selection acting on recessive beneficial alleles in the heterogametic sex. More broadly, our analysis suggests that the adaptive potential of Z chromosome gene expression may be much greater than that of gene sequence, results which have important implications for the role of sex chromosomes in speciation and sexual selection
The Roles of Gene Duplication, Gene Conversion and Positive Selection in Rodent \u3ci\u3eEsp\u3c/i\u3e and \u3ci\u3eMup\u3c/i\u3e Pheromone Gene Families with Comparison to the \u3ci\u3eAbp\u3c/i\u3e Family
Three proteinaceous pheromone families, the androgen-binding proteins (ABPs), the exocrine-gland secreting peptides (ESPs) and the major urinary proteins (MUPs) are encoded by large gene families in the genomes of Mus musculus and Rattus norvegicus. We studied the evolutionary histories of the Mup and Esp genes and compared them with what is known about the Abp genes. Apparently gene conversion has played little if any role in the expansion of the mouse Class A and Class B Mup genes and pseudogenes, and the rat Mups. By contrast, we found evidence of extensive gene conversion in many Esp genes although not in all of them. Our studies of selection identified at least two amino acid sites in β-sheets as having evolved under positive selection in the mouse Class A and Class B MUPs and in rat MUPs. We show that selection may have acted on the ESPs by determining Ka/Ks for Exon 3 sequences with and without the converted sequence segment. While it appears that purifying selection acted on the ESP signal peptides, the secreted portions of the ESPs probably have undergone much more rapid evolution. When the inner gene converted fragment sequences were removed, eleven Esp paralogs were present in two or more pairs with Ka/Ks \u3e1.0 and thus we propose that positive selection is detectable by this means in at least some mouse Esp paralogs. We compare and contrast the evolutionary histories of all three mouse pheromone gene families in light of their proposed functions in mouse communication
Rapid Evolution of BRCA1 and BRCA2 in Humans and Other Primates
The maintenance of chromosomal integrity is an essential task of every living organism and cellular repair mechanisms exist to guard against insults to DNA. Given the importance of this process, it is expected that DNA repair proteins would be evolutionarily conserved, exhibiting very minimal sequence change over time. However, BRCA1, an essential gene involved in DNA repair, has been reported to be evolving rapidly despite the fact that many protein-altering mutations within this gene convey a significantly elevated risk for breast and ovarian cancers. Results: To obtain a deeper understanding of the evolutionary trajectory of BRCA1, we analyzed complete BRCA1 gene sequences from 23 primate species. We show that specific amino acid sites have experienced repeated selection for amino acid replacement over primate evolution. This selection has been focused specifically on humans and our closest living relatives, chimpanzees (Pan troglodytes) and bonobos (Pan paniscus). After examining BRCA1 polymorphisms in 7 bonobo, 44 chimpanzee, and 44 rhesus macaque (Macaca mulatta) individuals, we find considerable variation within each of these species and evidence for recent selection in chimpanzee populations. Finally, we also sequenced and analyzed BRCA2 from 24 primate species and find that this gene has also evolved under positive selection. Conclusions: While mutations leading to truncated forms of BRCA1 are clearly linked to cancer phenotypes in humans, there is also an underlying selective pressure in favor of amino acid-altering substitutions in this gene. A hypothesis where viruses are the drivers of this natural selection is discussed.National Institutes of Health R01-GM-093086, 8U42OD011197-13National Science Foundation BCS-07115972Burroughs Wellcome FundMolecular Bioscience
Exploiting the accumulated evidence for gene selection in microarray gene expression data
Machine Learning methods have of late made signicant efforts to solving multidisciplinary problems in the field of cancer classification using microarray gene expression data. Feature subset selection methods can play an important role in the modeling process, since these tasks are characterized by a large number of features and a few observations, making the modeling a non-trivial undertaking. In this particular scenario, it is extremely important to select genes by taking into account the possible interactions with other gene subsets. This paper shows that, by accumulating the evidence in favour (or against) each gene along the search process, the obtained gene subsets may constitute better solutions, either in terms of predictive accuracy or gene size, or in both. The proposed technique is extremely simple and applicable at a negligible overhead in cost.Postprint (published version
- …
