752 research outputs found

    Predicting Protein Contact Map By Bagging Decision Trees

    Get PDF
    Proteins\u27 function and structure are intrinsically related. In order to understand proteins\u27 functionality, it is essential for medical and biological researchers to deter- mine proteins\u27 three-dimensional structure. The traditional method using NMR spectroscopy or X-ray crystallography are inefficient compared to computational methods. Fortunately, substantial progress has been made in the prediction of protein structure in bioinformatics. Despite these achievements, the computational complexity of protein folding remains a challenge. Instead, many methods aim to predict a protein contact map from protein sequence using machine learning algorithms. In this thesis, we introduce a novel ensemble method for protein contact map prediction based on bagging multiple decision trees. A random sampling method is used to address the large class imbalance in contact maps. To generalize the feature space, we further clustered the amino acid alphabet from twenty to ten. A software is also developed to view protein contact map at certain threshold and separation. The parameters used in decision trees are determined experimentally, and the overall results for the first L, L/2 and L/5 predictions for protein of length L are evaluated

    GPCRTree: online hierarchical classification of GPCR function

    Get PDF
    Background: G protein-coupled receptors (GPCRs) play important physiological roles transducing extracellular signals into intracellular responses. Approximately 50% of all marketed drugs target a GPCR. There remains considerable interest in effectively predicting the function of a GPCR from its primary sequence. Findings: Using techniques drawn from data mining and proteochemometrics, an alignment-free approach to GPCR classification has been devised. It uses a simple representation of a protein's physical properties. GPCRTree, a publicly-available internet server, implements an algorithm that classifies GPCRs at the class, sub-family and sub-subfamily level. Conclusion: A selective top-down classifier was developed which assigns sequences within a GPCR hierarchy. Compared to other publicly available GPCR prediction servers, GPCRTree is considerably more accurate at every level of classification. The server has been available online since March 2008 at URL: http://igrid-ext.cryst.bbk.ac.uk/gpcrtree

    An Ensemble Learning Approach to Reverse-Engineering Transcriptional Regulatory Networks from Time-Series Gene Expression Data

    Get PDF
    Background One of the most challenging tasks in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its expression, and how a set of transcription factors coordinate to accomplish temporal and spatial specific regulations. Results Here we propose a supervised machine learning approach to address these questions. We focus our study on the gene transcriptional regulation of the cell cycle in the budding yeast, thanks to the large amount of data available and relatively well-understood biology, although the main ideas of our method can be applied to other data as well. Our method starts with building an ensemble of decision trees for each microarray data to capture the association between the expression levels of yeast genes and the binding of transcription factors to gene promoter regions, as determined by chromatin immunoprecipitation microarray (ChIP-chip) experiment. Cross-validation experiments show that the method is more accurate and reliable than the naive decision tree algorithm and several other ensemble learning methods. From the decision tree ensembles, we extract logical rules that explain how a set of transcription factors act in concert to regulate the expression of their targets. We further compute a profile for each rule to show its regulation strengths at different time points. We also propose a spline interpolation method to integrate the rule profiles learned from several time series expression data sets that measure the same biological process. We then combine these rule profiles to build a transcriptional regulatory network for the yeast cell cycle. Compared to the results in the literature, our method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. Our method also identifies many interesting synergetic relationships among these transcription factors, most of which are well known, while many of the rest can also be supported by other evidences. Conclusion The high accuracy of our method indicates that our method is valid and robust. As more gene expression and transcription factor binding data become available, we believe that our method is useful for reconstructing large-scale transcriptional regulatory networks in other species as well

    Machine learning approaches for tomato crop yield prediction in precision agriculture

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe objective of this project was to apply ML techniques to predict processing tomato crop yield given information on soil properties, weather conditions, and applied fertilizers. Besides being robust enough for predicting tomato productivity, the model needed to be interpretable and transparent for the business. The models assessed were Decision Trees Regression, ensemble bagging models like Random Forest Regression, and boosting techniques like Gradient Boosting Regression, and Support Vector Regression. Overall, Gradient Boosting and Support Vector models presented the best performance. For improving the predictive power, we combined the predictions of our two best models into a stacked approach with a Ridge Regression as the final model. The generalization error of the final chosen model on new data was 9.02 ton/ha for the MAE metric, 9.5% for the MAPE, and 13.5 ton/ha for the RMSE. This means that our model can predict tomato crop yield with an approximate error of 9 ton/ha. Even though our final model was complex and not intrinsically interpretable, we were able to apply model-agnostic interpretation methods like the SHAP summary plot to better understand the feature importance and feature effects, and the Accumulated Local Effects (ALE) plot, to explain how features influence the outcome of the model on average. In general, the objectives of the project were accomplished and the company was satisfied with the result of the model and its interpretation

    Detection of Sand Boils from Images using Machine Learning Approaches

    Get PDF
    Levees provide protection for vast amounts of commercial and residential properties. However, these structures degrade over time, due to the impact of severe weather, sand boils, subsidence of land, seepage, etc. In this research, we focus on detecting sand boils. Sand boils occur when water under pressure wells up to the surface through a bed of sand. These make levees especially vulnerable. Object detection is a good approach to confirm the presence of sand boils from satellite or drone imagery, which can be utilized to assist in the automated levee monitoring methodology. Since sand boils have distinct features, applying object detection algorithms to it can result in accurate detection. To the best of our knowledge, this research work is the first approach to detect sand boils from images. In this research, we compare some of the latest deep learning methods, Viola Jones algorithm, and other non-deep learning methods to determine the best performing one. We also train a Stacking-based machine learning method for the accurate prediction of sand boils. The accuracy of our robust model is 95.4%

    Improving random forests by feature dependence analysis

    Get PDF
    Random forests (RFs) have been widely used for supervised learning tasks because of their high prediction accuracy good model interpretability and fast training process. However they are not able to learn from local structures as convolutional neural networks (CNNs) do when there exists high dependency among features. They also cannot utilize features that are jointly dependent on the label but marginally independent of it. In this dissertation we present two approaches to address these two problems respectively by dependence analysis. First a local feature sampling (LFS) approach is proposed to learn and use the locality information of features to group dependent/correlated features to train each tree. For image data the local information of features (pixels) is defined by the 2-D grid of the image. For non-image data we provided multiple ways of estimating this local structure. Our experiments shows that RF with LFS has reduced correlation and improved accuracy on multiple UCI datasets. To address the latter issue of random forest mentioned we propose a way to categorize features as marginally dependent features and jointly dependent features the latter is defined by minimum dependence sets (MDS\u27s) or by stronger dependence sets (SDS\u27s). Algorithms to identify MDS\u27s and SDS\u27s are provided. We then present a feature dependence mapping (FDM) approach to map the jointly dependent features to another feature space where they are marginally dependent. We show that by using FDM decision tree and RF have improved prediction performance on artificial datasets and a protein expression dataset

    Data Mining Application for Healthcare Sector: Predictive Analysis of Heart Attacks

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceCardiovascular diseases are the main cause of the number of deaths in the world, being the heart disease the most killing one affecting more than 75% of individuals living in countries of low and middle earnings. Considering all the consequences, firstly for the individual’s health, but also for the health system and the cost of healthcare (for instance, treatments and medication), specifically for cardiovascular diseases treatment, it has become extremely important the provision of quality services by making use of preventive medicine, whose focus is identifying the disease risk, and then, applying the right action in case of early signs. Therefore, by resorting to DM (Data Mining) and its techniques, there is the ability to uncover patterns and relationships amongst the objects in healthcare data, giving the potential to use it more efficiently, and to produce business intelligence and extract knowledge that will be crucial for future answers about possible diseases and treatments on patients. Nowadays, the concept of DM is already applied in medical information systems for clinical purposes such as diagnosis and treatments, that by making use of predictive models can diagnose some group of diseases, in this case, heart attacks. The focus of this project consists on applying machine learning techniques to develop a predictive model based on a real dataset, in order to detect through the analysis of patient’s data whether a person can have a heart attack or not. At the end, the best model is found by comparing the different algorithms used and assessing its results, and then, selecting the one with the best measures. The correct identification of early cardiovascular problems signs through the analysis of patient data can lead to the possible prevention of heart attacks, to the consequent reduction of complications and secondary effects that the disease may bring, and most importantly, to the decrease on the number of deaths in the future. Making use of Data Mining and analytics in healthcare will allow the analysis of high volumes of data, the development of new predictive models, and the understanding of the factors and variables that have the most influence and contribution for this disease, which people should pay attention. Hence, this practical approach is an example of how predictive analytics can have an important impact in the healthcare sector: through the collection of patient’s data, models learn from it so that in the future they can predict new unknown cases of heart attacks with better accuracies. In this way, it contributes to the creation of new models, to the tracking of patient’s health data, to the improvement of medical decisions, to efficient and faster responses, and to the wellbeing of the population that can be improved if diseases like this can be predicted and avoided. To conclude, this project aims to present and show how Data Mining techniques are applied in healthcare and medicine, and how they contribute for the better knowledge of cardiovascular diseases and for the support of important decisions that will influence the patient’s quality of life

    Constructing gene expression based prognostic models to predict recurrence and lymph node metastasis in colon cancer

    Get PDF
    The main goal of this study is to identify molecular signatures to predict lymph node metastases and recurrence in colon cancer patients. Recent advances in microarray technology facilitated building of accurate molecular classifiers, and in depth understanding of disease mechanisms.;Lymph node metastasis cannot be accurately estimated by morphological assessment. Molecular markers have the potential to improve prognostic accuracy. The first part of our study presents a novel technique to identify molecular markers for predicting stage of the disease based on microarray gene expression data. In the first step, random forests were used for variable selection and a 14-gene signature was identified. In the second step, the genes without differential expression in lymph node negative versus positive tumors were removed from the 14-gene signature, leading to the identification of a 9-gene signature. The lymph node status prediction accuracy of the 9-gene signature on an independent colon cancer dataset (n=17) was 82.3%. Area under curve (AUC) obtained from the time-dependent ROC curves using the 9-gene signature was 0.85 and 0.86 for relapse-free survival and overall survival, respectively. The 9-gene signature significantly stratified patients into low-risk and high-risk groups (log-rank tests, p\u3c0.05, n=73), with distinct relapse-free survival and overall survival. Based on the results, it could be concluded that the 9-gene signature could be used to identify lymph node metastases in patients. We further studied the 9-gene signature using correlation analysis on CGH and RNA expression datasets. It was found that the gene ITGB1 in the 9-gene signature exhibited strong relationship of DNA copy number and gene expression. Furthermore, genome-wide correlation analysis was done on CGH and RNA data, and three or more consecutive genes with significant correlation of DNA copy number and RNA expression were identified. These results might be helpful in identifying the regulators of gene expression.;The second part of the study was focused on identifying molecular signatures for patients at high-risk for recurrence who would benefit from adjuvant chemotherapy. The training set (n=36) consisted of patients who remained disease-free for 5 years and patients who experienced recurrence within 5 years. The remaining patients formed the testing set (n=37). A combinatorial scheme was developed to identify gene signatures predicting colon cancer recurrence. In the first step, preprocessing was done to discard undifferentiated genes and missing values were replaced with k=30 and k=20 using the k-nearest neighbors algorithm. Variable selection using the random forests algorithm was applied to obtain gene subsets. In the second step, InfoGain feature selection technique was used to drop lower ranked genes from the gene subsets based on their association with disease outcome. A 3-gene and a 5-gene signature were identified by this technique based on different missing value replacement methods. Both of the recurrence gene signatures stratified patients into low-risk and high-risk groups (log-rank tests, p\u3c0.05, n=73), with distinct relapse-free survival and overall survival. A recurrence prediction model was built using LWL classifier based on the 3-gene signature with an accuracy of 91.7% on the training set (n=36). Another recurrence prediction model was built using the random tree classifier based on the 5-gene signature with an accuracy of 83.3% on the training set (n=36). The prospective predictions obtained on the testing set using these models will be verified when the follow-up information becomes available in the future. The recurrence prediction accuracies of these gene signatures on independent colon cancer datasets were in the range 72.4% to 88.9%. These prognostic models might be helpful to clinicians in selecting more appropriate treatments for patients who are at high-risk of developing recurrence. When compared over multiple datasets, the 3-gene signature had improved prediction accuracy over the 5-gene signature. The identified lymph node and recurrence gene signatures were validated on rectal cancer data. Time-dependent ROC and Kaplan-Meier analysis were done producing significant results. These results support the fact that the developed prognostic models could be used to identify patients at high-risk of developing recurrence and get an estimate of the survival times in rectal cancer patients

    Augmenting Structure/Function Relationship Analysis with Deep Learning for the Classification of Psychoactive Drug Activity at Class A G Protein-Coupled Receptors

    Get PDF
    G protein-coupled receptors (GPCRs) initiate intracellular signaling pathways via interaction with external stimuli. [1-5] Despite sharing similar structure and cellular mechanism, GPCRs participate in a uniquely broad range of physiological functions. [6] Due to the size and functional diversity of the GPCR family, these receptors are a major focus for pharmacological applications. [1,7] Current state-of-the-art pharmacology and toxicology research strategies rely on computational methods to efficiently design highly selective, low toxicity compounds. [9], [10] GPCR-targeting therapeutics are associated with low selectivity resulting in increased risk of adverse effects and toxicity. Psychoactive drugs that are active at Class A GPCRs used in the treatment of schizophrenia and other psychiatric disorders display promiscuous binding behavior linked to chronic toxicity and high-risk adverse effects. [16-18] We hypothesized that using a combination of physiochemical feature engineering with a feedforward neural network, predictive models can be trained for these specific GPCR subgroups that are more efficient and accurate than current state-of-the-art methods.. We combined normal mode analysis with deep learning to create a novel framework for the prediction of Class A GPCR/psychoactive drug interaction activities. Our deep learning classifier results in high classification accuracy (5-HT F1-score = 0.78; DRD F1-score = 0.93) and achieves a 45% reduction in model training time when structure-based feature selection is applied via guidance from an anisotropic network model (ANM). Additionally, we demonstrate the interpretability and application potential of our framework via evaluation of highly clinically relevant Class A GPCR/psychoactive drug interactions guided by our ANM results and deep learning predictions. Our model offers an increased range of applicability as compared to other methods due to accessible data compatibility requirements and low model complexity. While this model can be applied to a multitude of clinical applications, we have presented strong evidence for the impact of machine learning in the development of novel psychiatric therapeutics with improved safety and tolerability
    • …
    corecore