2,725 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    ABC-sampling for balancing imbalanced datasets based on artificial bee colony algorithm

    Full text link
    © 2015 IEEE. Class imbalanced data is a common problem for predictive modelling in domains such as bioinformatics. It occurs when the distribution of classes is not uniform among samples and results in a biased prediction of learning towards majority classes. In this study, we propose the ABC-Sampling algorithm based on a swarm optimization method called Artificial Bee Colony, which models the natural foraging behaviour of honeybees. Our algorithm lessens the effects of imbalanced classes by selecting the most informative majority samples using a forward search and storing them in a ranked subset. Then we construct a balanced dataset with a planned undersampling strategy to extract the most frequent majority samples from the top ranked subset and combine them with all minority samples. Our algorithm is superior to a state-of-the-art method on nine benchmark datasets with various levels of imbalance ratios

    Predicting Pancreatic Cancer Using Support Vector Machine

    Get PDF
    This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately

    Exploring issues of balanced versus imbalanced samples in mapping grass community in the telperion reserve using high resolution images and selected machine learning algorithms

    Get PDF
    ABSTRACT Accurate vegetation mapping is essential for a number of reasons, one of which is for conservation purposes. The main objective of this research was to map different grass communities in the game reserve using RapidEye and Sentinel-2 MSI images and machine learning classifiers [support vector machine (SVM) and Random forest (RF)] to test the impacts of balanced and imbalance training data on the performance and the accuracy of Support Vector Machine and Random forest in mapping the grass communities and test the sensitivities of pixel resolution to balanced and imbalance training data in image classification. The imbalanced and balanced data sets were obtained through field data collection. The results show RF and SVM are producing a high overall accuracy for Sentinel-2 imagery for both the balanced and imbalanced data set. The RF classifier has yielded an overall accuracy of 79.45% and kappa of 74.38% and an overall accuracy of 76.19% and kappa of 73.21% using imbalanced and balanced training data respectively. The SVM classifier yielded an overall accuracy of 82.54% and kappa of 80.36% and an overall accuracy of 82.21% and a kappa of 78.33% using imbalanced and balanced training data respectively. For the RapidEye imagery, RF and SVM algorithm produced overall accuracy affected by a balanced data set leading to reduced accuracy. The RF algorithm had an overall accuracy that dropped by 6% (from 63.24% to 57.94%) while the SVM dropped by 7% (from 57.31% to 50.79%). The results thereby show that the imbalanced data set is a better option when looking at the image classification of vegetation species than the balanced data set. The study recommends the implementation of ways of handling misclassification among the different grass species to improve classification for future research. Further research can be carried out on other types of high resolution multispectral imagery using different advanced algorithms on different training size samples.EM201

    Refining, Testing, and Applying Thermal Species Distribution Models to Enhance Ecological Assessments

    Get PDF
    The temperature of streams and rivers is changing rapidly in response to a variety of human activities. This rapid change is concerning because the abundances and distributions of many aquatic species in streams and rivers are strongly associated with temperature. Linking observations of temperature effects on species distributions with observations of temperature effects on fitness is important for improving confidence that temperature (and not some other variable) is causing the distributions we observe. Furthermore, producing accurate models of temperature effects on species distributions may allow us to develop tools to diagnose whether or not thermal pollution has impaired aquatic life. Such a diagnostic tool could help us better target management efforts on the specific stressors impairing aquatic life. In chapter two, I describe several laboratory experiments designed to examine the link between the effects of temperature observed in the field with effects of temperature observed in the laboratory. I found that the effects of temperature on survival were correlated with the thermal limits inferred from species distributions, which supports the hypothesis that temperature influences distributions by affecting the survival of species. In chapters three and four, I assessed two techniques that could potentially improve our ability to model relationships between temperature and distributions. In chapter three, I show that methods for dealing with imbalanced data broadly improved our ability to model the relationship between predictor variables (temperature and other variables)and species distributions. In chapter four, I evaluated a recently developed technique (deep artificial neural networks) for modeling large complex datasets. I found that deep artificial neural networks did not improve predictions over that of standard artificial neural networks and random forest models. In chapter five, I developed and evaluated a diagnostic biotic index for diagnosing the likelihood that temperature has affected macroinvertebrate species in streams and rivers. This index showed that 2.6% of streams across the continental United States had species with thermal tolerances higher than expected compared with thermally undisturbed conditions
    corecore