373 research outputs found

    Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.

    Get PDF
    PhDUnderstanding the etiology of complex disease remains a challenge in biology. In recent years there has been an explosion in biological data, this study investigates machine learning and network analysis methods as tools to aid candidate disease gene prioritisation, specifically relating to hypertension and cardiovascular disease. This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties using a classifier to provide a model for predicting deleterious nsSNPs. The degree of sequence conservation at the nsSNP position was found to be the single best attribute but other sequence and structural attributes in combination were also useful. Predictions for nsSNPs within Ensembl have been made publicly available. Secondly, predicting protein function for proteins with an absence of experimental data or lack of clear similarity to a sequence of known function was addressed. Protein domain attributes based on physicochemical and predicted structural characteristics of the sequence were used as input to classifiers for predicting membership of large and diverse protein superfamiles from the SCOP database. An enrichment method was investigated that involved adding domains to the training dataset that are currently absent from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers achieved 66.3% for single domain proteins and 55.6% when including domains from multi domain proteins. The domains from superfamilies with low sequence similarity, share global sequence properties enabling applications to be developed which compliment profile methods for detecting distant sequence relationships. Thirdly, a topological analysis of the human protein interactome was performed. The results were combined with functional annotation and sequence based properties to build models for predicting hypertension associated proteins. The study found that predicted hypertension related proteins are not generally associated with network hubs and do not exhibit high clustering coefficients. Despite this, they tend to be closer and better connected to other hypertension proteins on the interaction network than would be expected by chance. Classifiers that combined PPI network, amino acid sequence and functional properties produced a range of precision and recall scores according to the applied 3 weights. Finally, interactome properties of proteins implicated in cardiovascular disease and cancer were studied. The analysis quantified the influential (central) nature of each protein and defined characteristics of functional modules and pathways in which the disease proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential (p<0.05) in the interactome. Additionally, they cluster in large, complex, highly connected communities, acting as interfaces between multiple processes more often than expected. An approach to prioritising disease candidates based on this analysis was proposed. Each analyses can provide some new insights into the effort to identify novel disease related proteins for cardiovascular disease

    Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

    Get PDF
    The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

    Detecting Adverse Drug Events Using a Deep neural network Model

    Get PDF
    Adverse drug events represent a key challenge in public health, especially with respect to drug safety profiling and drug surveillance. Drug-drug interactions represent one of the most popular types of adverse drug events. Most computational approaches to this problem have used different types of drug-related information utilizing different types of machine learning algorithms to predict potential interactions between drugs. In this work, our focus is on the use of genetic information about the drugs, in particular, the protein sequence and protein structure of drug protein targets to predict potential interactions between drugs. We collected information on drug-drug interactions (DDIs) from the DrugBank database and divided them into multiple datasets based on the type of information, such as, chemical structure, protein targets, side effects, pathways, protein-protein interactions, protein structure, information about indications. We proposed a similarity-based Neural Network framework called protein sequence-structure similarity network (S3N), and used this to predict the novel DDI’s. The drug-drug similarities are computed using different categories of drug information based on multiple similarity metrics. We compare the results with those from the state-of-the art methods on this problem. Our results show that proposed method is quite competitive, at times outperforming the state-of-the-art. Our performance evaluations on different datasets showed the predictive performance as follows: Precision 91\%-98\%, Recall 90\%-96\%, F1 Score 86\%-95\%, AUC 88\%-99\% Accuracy 86\%-95\%. To further investigate the reliability of the proposed method, we utilize 158 drugs related to cardiovascular disease to evaluate the performance of our model and find out the new interactions among the drugs. Our model showed 90\% accuracy of detecting the existing drug interactions and identified 60 new DDI’s for the cardiovascular drugs. Our evaluation demonstrates the effectiveness of S3N in predicting DDI’s

    DapBCH: a disease association prediction model Based on Cross-species and Heterogeneous graph embedding

    Get PDF
    The study of comorbidity can provide new insights into the pathogenesis of the disease and has important economic significance in the clinical evaluation of treatment difficulty, medical expenses, length of stay, and prognosis of the disease. In this paper, we propose a disease association prediction model DapBCH, which constructs a cross-species biological network and applies heterogeneous graph embedding to predict disease association. First, we combine the human disease–gene network, mouse gene–phenotype network, human–mouse homologous gene network, and human protein–protein interaction network to reconstruct a heterogeneous biological network. Second, we apply heterogeneous graph embedding based on meta-path aggregation to generate the feature vector of disease nodes. Finally, we employ link prediction to obtain the similarity of disease pairs. The experimental results indicate that our model is highly competitive in predicting the disease association and is promising for finding potential disease associations

    Role of network topology based methods in discovering novel gene-phenotype associations

    Get PDF
    The cell is governed by the complex interactions among various types of biomolecules. Coupled with environmental factors, variations in DNA can cause alterations in normal gene function and lead to a disease condition. Often, such disease phenotypes involve coordinated dysregulation of multiple genes that implicate inter-connected pathways. Towards a better understanding and characterization of mechanisms underlying human diseases, here, I present GUILD, a network-based disease-gene prioritization framework. GUILD associates genes with diseases using the global topology of the protein-protein interaction network and an initial set of genes known to be implicated in the disease. Furthermore, I investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. I also introduce GUILDify, an online and user-friendly tool which prioritizes genes for their association to any user-provided phenotype. Finally, I describe current state-of-the-art systems-biology approaches where network modeling has helped extending our view on diseases such as cancer.La cèl•lula es regeix per interaccions complexes entre diferents tipus de biomolècules. Juntament amb factors ambientals, variacions en el DNA poden causar alteracions en la funció normal dels gens i provocar malalties. Sovint, aquests fenotips de malaltia involucren una desregulació coordinada de múltiples gens implicats en vies interconnectades. Per tal de comprendre i caracteritzar millor els mecanismes subjacents en malalties humanes, en aquesta tesis presento el programa GUILD, una plataforma que prioritza gens relacionats amb una malaltia en concret fent us de la topologia de xarxe. A partir d’un conjunt conegut de gens implicats en una malaltia, GUILD associa altres gens amb la malaltia mitjancant la topologia global de la xarxa d’interaccions de proteïnes. A més a més, analitzo les relacions mecanístiques entre gens associats a malalties i explico la robustesa es desprèn d’aquesta anàlisi. També presento GUILDify, un servidor web de fácil ús per la priorització de gens i la seva associació a un determinat fenotip. Finalment, descric els mètodes més recents en què el model•latge de xarxes ha ajudat extendre el coneixement sobre malalties complexes, com per exemple a càncer

    Computational approaches to study drug resistance mechanisms

    Get PDF
    Drug resistance is a major obstacle faced by therapists in treating complex diseases like cancer, epilepsy, arthritis and HIV infected patients. The reason behind these phenomena is either protein mutation or the changes in gene expression level that induces resistance to drug treatments. These mutations affect the drug binding activity, hence resulting in failure of treatment. All this information has been stored in PubMed directories as text data. Extracting useful knowledge from an unstructured textual data is a challenging task for biologists, since biomedical literature is growing exponentially on a daily basis. Building an automated method for such tasks is gaining much attention among researchers. In this thesis we have developed a disease categorized database ZK DrugResist that automatically extracts mutations and expression changes associated with drug resistance from PubMed. This tool also includes semantic relations extracted from biomedical text covering drug resistance and established a server including both of these features. Our system was tested for three relations, Resistance (R), Intermediate (I) and Susceptible (S) by applying hybrid feature set. From the last few decades the focus has changed to hybrid approaches as it provides better results. In our case this approach combines rule-based methods with machine learning techniques. The results showed 97.7% accuracy with 96% precision, recall and F-measure. The results have outperformed the previously existing relation extraction systems thus facilitating computational analysis of drug resistance against complex diseases and further can be implemented on other areas of biomedicine. Literature is filled with HIV drug resistance providing the worth of training data as compared to other diseases, hence we developed a computational method to predict HIV resistance. For this we combined both sequence and structural features and applied SVM and Random Forests classifiers. The model was tested on the mutants of HIV-1 protease and reverse transcriptase.Taken together the features we have used in our method, total contact energies among multiple mutations have a strong impact in predicting resistance as they are crucial in understanding the interactions of HIV mutants. The combination of sequence-structure features o↵ers high accuracy with support vector machines as compared to Random Forests classifier. Both single and acquisition of multiple mutations are important in predicting HIV resistance to certain drug treatments. We have discovered the practicality of these features; hence these can be used in the future to predict resistance for other complex diseases. Another way to deal drug resistance is the application of drug repurposing. Drug often binds to more that one targets defined as polypharmacology which can be applied to drug repositioning also referred as therapeutic switching. The traditional drug discovery and development is a high-priced and tedious process, thus making drug repurposing a popular alternate strategy. We have proposed a method based on similarity scheme that predicts both approved and novel targets for drug and new disease associations. We combined PPI, biological pathways, binding site structural similarities and disease-disease similarity measures. We used sixty drugs for training the algorithm and tested it on eight separate drugs. The results showed 95% accuracy in predicting the approved and novel targets surpassing the existing methods. All these parameters help in elucidating the unknown associations between drug and diseases for finding the new uses for old drugs. Hence repurposing offers novel candidates from existing pool of drugs providing a ray of hope in combating drug resistance

    Molecular Science for Drug Development and Biomedicine

    Get PDF
    With the avalanche of biological sequences generated in the postgenomic age, molecular science is facing an unprecedented challenge, i.e., how to timely utilize the huge amount of data to benefit human beings. Stimulated by such a challenge, a rapid development has taken place in molecular science, particularly in the areas associated with drug development and biomedicine, both experimental and theoretical. The current thematic issue was launched with the focus on the topic of “Molecular Science for Drug Development and Biomedicine”, in hopes to further stimulate more useful techniques and findings from various approaches of molecular science for drug development and biomedicine
    corecore