1,649 research outputs found

    Large Margin Random Forests On Mixed Type Data

    Get PDF
    Incorporating various sources of biological information is important for biological discovery. For example, genes have a multi-view representation. They can be represented by features such as sequence length and physical-chemical properties. They can also be represented by pairwise similarities, gene expression levels, and phylogenetics position. Hence, the types vary from numerical features to categorical features. An efficient way of learning from observations with a multi-view representation of mixed type of data is thus important. We propose a large margin random forests classification approach based on random forests proximity. Random forests accommodate mixed data types naturally. Large margin classifiers are obtained from the random forests proximity kernel or its derivative kernels. We test the approach on four biological datasets. The performance is promising compared with other state of the art methods including support vector machines (SVMs) and Random Forests classifiers. It demonstrates high potential in the discovery of functional roles of genes and proteins. We also examine the effects of mixed type of data on the algorithms used

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment

    Laurel Wilt Disease: Early Detection through Canine Olfaction and Omics Insights into Disease Progression

    Get PDF
    Laurel wilt disease is a vascular wilt affecting the xylem and water conductivity in trees belonging to the family Lauraceae. The disease was introduced by an invasive species of ambrosia beetle, Xyleborus glabratus. The beetle, together with its newly described fungal symbiont Raffaelea lauricola (pathogenic to host trees), has lead to the devastation and destruction of over 300 million wild redbay trees in southeastern forests. Ambrosia beetles make up a very unique clade of beetle and share a co-evolved obligatory mutualistic relationship with their partner fungi. Rather than consuming host tree material, the beetles excavate galleries or canals within them. These galleries serve two purposes: reproduction and fungal gardening. The beetles house fungal spores within specialized sacs, mycangia, and essentially inoculate host trees with the pathogenic agent. They actively grow and cultivate gardens of the fungus in galleries to serve as their sole food source. Once the fungus reaches the xylem vessels of the host tree, it thrives and leads to the blockage of water flow, both because of fungal accumulation and to the host response of secreting gels, gums and tyloses to occlude vessels in an attempt to quarantine the fungus. This disease spreads rapidly, and as a result, once symptoms become visible to the naked eye, it is already too late to save the tree, and it has likely already spread to adjacent ones. The present study presents the first documented study involving the early detection of disease from deep within a tree through the use of scent-discriminating canines. In addition, the present study has lead to the development of a novel sample collection device enabling the non-destructive sampling of beetle galleries. Finally, a metabolomics approach revealed key biochemical pathway modifications in the disease state, as well as potential clues to disease development

    Transcription factor expression as a predictor of colon cancer prognosis: a machine learning practice

    Get PDF
    Background Colon cancer is one of the leading causes of cancer deaths in the USA and around the world. Molecular level characters, such as gene expression levels and mutations, may provide profound information for precision treatment apart from pathological indicators. Transcription factors function as critical regulators in all aspects of cell life, but transcription factors-based biomarkers for colon cancer prognosis were still rare and necessary. Methods We implemented an innovative process to select the transcription factors variables and evaluate the prognostic prediction power by combining the Cox PH model with the random forest algorithm. We picked five top-ranked transcription factors and built a prediction model by using Cox PH regression. Using Kaplan-Meier analysis, we validated our predictive model on four independent publicly available datasets (GSE39582, GSE17536, GSE37892, and GSE17537) from the GEO database, consisting of 925 colon cancer patients. Results A five-transcription-factors based predictive model for colon cancer prognosis has been developed by using TCGA colon cancer patient data. Five transcription factors identified for the predictive model is HOXC9, ZNF556, HEYL, HOXC4 and HOXC6. The prediction power of the model is validated with four GEO datasets consisting of 1584 patient samples. Kaplan-Meier curve and log-rank tests were conducted on both training and validation datasets, the difference of overall survival time between predicted low and high-risk groups can be clearly observed. Gene set enrichment analysis was performed to further investigate the difference between low and high-risk groups in the gene pathway level. The biological meaning was interpreted. Overall, our results prove our prediction model has a strong prediction power on colon cancer prognosis. Conclusions Transcription factors can be used to construct colon cancer prognostic signatures with strong prediction power. The variable selection process used in this study has the potential to be implemented in the prognostic signature discovery of other cancer types. Our five TF-based predictive model would help with understanding the hidden relationship between colon cancer patient survival and transcription factor activities. It will also provide more insights into the precision treatment of colon cancer patients from a genomic information perspective

    Testing Conditional Independence in Supervised Learning Algorithms

    Get PDF
    We propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Cand\`es et al. (2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and simulated datasets. Simulations confirm that our inference procedures successfully control Type I error and achieve nominal coverage probability. Our method has been implemented in an R package, cpi, which can be downloaded from https://github.com/dswatson/cpi

    The Evolution of Diversity

    Get PDF
    Since the beginning of time, the pre-biological and the biological world have seen a steady increase in complexity of form and function based on a process of combination and re-combination. The current modern synthesis of evolution known as the neo-Darwinian theory emphasises population genetics and does not explain satisfactorily all other occurrences of evolutionary novelty. The authors suggest that symbiosis and hybridisation and the more obscure processes such as polyploidy, chimerism and lateral transfer are mostly overlooked and not featured sufficiently within evolutionary theory. They suggest, therefore, a revision of the existing theory including its language, to accommodate the scientific findings of recent decades

    Potential transgenic routes to increase tree biomass

    Get PDF
    AbstractBiomass is a prime target for genetic engineering in forestry because increased biomass yield will benefit most downstream applications such as timber, fiber, pulp, paper, and bioenergy production. Transgenesis can increase biomass by improving resource acquisition and product utilization and by enhancing competitive ability for solar energy, water, and mineral nutrients. Transgenes that affect juvenility, winter dormancy, and flowering have been shown to influence biomass as well. Transgenic approaches have increased yield potential by mitigating the adverse effects of prevailing stress factors in the environment. Simultaneous introduction of multiple genes for resistance to various stress factors into trees may help forest trees cope with multiple or changing environments. We propose multi-trait engineering for tree crops, simultaneously deploying multiple independent genes to address a set of genetically uncorrelated traits that are important for crop improvement. This strategy increases the probability of unpredictable (synergistic or detrimental) interactions that may substantially affect the overall phenotype and its long-term performance. The very limited ability to predict the physiological processes that may be impacted by such a strategy requires vigilance and care during implementation. Hence, we recommend close monitoring of the resultant transgenic genotypes in multi-year, multi-location field trials

    Inter-sexual and inter-seasonal differences in the chemical signalling strategies of brown bears

    Get PDF
    The brown bear (Ursus arctos) is a species which, due to its solitary, dominance hierarchy social system and large home range, is thought to rely heavily on chemical signals as a means of communication. Through camera traps orientated towards bear ‘rub trees’ over a two-year period, we assessed the proportional contribution of scent marking in different seasons by different age sex classes, and gained insights into the role of chemical signalling in maintaining social structure. We found, during the breeding season (June-July), that both adult males (n=38 P1 year (n=11 P=0.003) scent marked trees significantly more often than expected, whereas lone adult females (n=7) and subadults (n=3) marked less than expected. Outside of the breeding season (August-October), adult males (n=70) marked in an expected proportion, females with cubs (all ages) marked significantly more than expected (n=71 P<0.001), and lone adult females (n=11) and subadults (n=15) marked less than expected. During both the breeding season (n=7 P=0.026) and the fall (n=11 P<0.001), adult females marked trees significantly less than their occurrence on bear trails would expect, as did subadults during the breeding season (n=3 P=0.026) but not during the fall (n=15). Adult males marked at significantly high frequencies both during and outside of the breeding season, potentially to communicate dominance between males. Supported by the low frequency of scent marking by subadults. We observed a total avoidance of bear trails containing active rub trees by females with cubs <1 year during the breeding season, a possible counterstrategy to sexually selected infanticide due to the strong male bias in scent marking during the breeding season. We hypothesize that scent marking in brown bears is taught by the mother, beginning with cubs <1 year outside of the breeding season at a relatively ‘safe’ time of year
    • …
    corecore