52 research outputs found

    Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem

    Get PDF
    The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods

    Penggunaan Random Under Sampling Untuk Penanganan Ketidakseimbangan Kelas Pada Prediksi Cacat Software Berbasis Neural Network

    Full text link
    Penurunan kualitas software dan biaya perbaikan yang tinggi dapat diakibatkan kesalahan atau cacat pada software. Prediksi cacat software sangat penting di dalam software engineering, terutama dalam mengatasi masalah efektifitas dan efisiensi sehingga dapat meningkatkan kualitas software. Neural Network (NN) merupakan algoritma klasifikasi yang telah terbukti mampu mengatasi masalah data nonlinear dan memiliki sensitifitas yang tinggi terhadap suatu data serta mampu menganalisa data yang besar. Dataset NASA MDP merupakan data metric yang nonlinear perangkat lunak yang biasa digunakan untuk penelitian software defect prediction (prediksi cacat software). Terdapat 62 penelitian dari 208 penelitian menggunakan dataset NASA. NASA MDP memiliki kelemahan yaitu kelas yang tidak seimbang sehingga dapat menurunkan kinerja dari model prediksi cacat software. Untuk menangani ketidakseimbangan kelas dalam dataset NASA MDP adalah dengan menggunakan metode level data yaitu Random Under Sampling (RUS). RUS ditujukan untuk memperbaiki ketidakseimbangan kelas. Metode yang diusulkan untuk menangani ketidakseimbangan kelas pada Neural Network (NN) adalah penerapan RUS. Eksperimen yang diusulkan untuk membandingkan hasil kinerja Neural Network sebelum dan sesudah diterapkan metode RUS, serta dibandingkan dengan model yang lainnya. Hasil Eksperimen rata-rata AUC pada NN (0.80) dan NN+RUS (0.82). Hasil uji Wilcoxon dan Friedman menunjukan bahwa bahwa AUC NN+RUS memiliki perbedaan yang signifikan dengan NN dengan p-value wilcoxon = 0.002 dan p-value friedman = 0.003 (p<0.05). Menurut uji friedman terdapat perbedaan AUC yang signifikan antara NN+RUS dengan NN, NN+SMOTE, NB, dan C45 karena nilai p-value < 0.0001. Maka dapat disimpulkan bahwa penerapan model RUS terbukti dapat menangani masalah ketidakseimbangan kelas pada prediksi cacat software berbasis neural network

    Machine Learning for Kinase Drug Discovery

    Get PDF
    Cancer is one of the major public health issues, causing several million losses every year. Although anti-cancer drugs have been developed and are globally administered, mild to severe side effects are known to occur during treatment. Computer-aided drug discovery has become a cornerstone for unveiling treatments of existing as well as emerging diseases. Computational methods aim to not only speed up the drug design process, but to also reduce time-consuming, costly experiments, as well as in vivo animal testing. In this context, over the last decade especially, deep learning began to play a prominent role in the prediction of molecular activity, property and toxicity. However, there are still major challenges when applying deep learning models in drug discovery. Those challenges include data scarcity for physicochemical tasks, the difficulty of interpreting the prediction made by deep neural networks, and the necessity of open-source and robust workflows to ensure reproducibility and reusability. In this thesis, after reviewing the state-of-the-art in deep learning applied to virtual screening, we address the previously mentioned challenges as follows: Regarding data scarcity in the context of deep learning applied to small molecules, we developed data augmentation techniques based on the SMILES encoding. This linear string notation enumerates the atoms present in a compound by following a path along the molecule graph. Multiplicity of SMILES for a single compound can be reached by traversing the graph using different paths. We applied the developed augmentation techniques to three different deep learning models, including convolutional and recurrent neural networks, and to four property and activity data sets. The results show that augmentation improves the model accuracy independently of the deep learning model, as well as of the data set size. Moreover, we computed the uncertainty of a model by using augmentation at inference time. In this regard, we have shown that the more confident the model is in its prediction, the smaller is the error, implying that a given prediction can be trusted and is close to the target value. The software and associated documentation allows making predictions for novel compounds and have been made freely available. Trusting predictions blindly from algorithms may have serious consequences in areas of healthcare. In this context, better understanding how a neural network classifies a compound based on its input features is highly beneficial by helping to de-risk and optimize compounds. In this research project, we decomposed the inner layers of a deep neural network to identify the toxic substructures, the toxicophores, of a compound that led to the toxicity classification. Using molecular fingerprints —vectors that indicate the presence or absence of a particular atomic environment —we were able to map a toxicity score to each of these substructures. Moreover, we developed a method to visualize in 2D the toxicophores within a compound, the so- called cytotoxicity maps, which could be of great use to medicinal chemists in identifying ways to modify molecules to eliminate toxicity. Not only does the deep learning model reach state-of-the-art results, but the identified toxicophores confirm known toxic substructures, as well as expand new potential candidates. In order to speed up the drug discovery process, the accessibility to robust and modular workflows is extremely advantageous. In this context, the fully open-source TeachOpenCADD project was developed. Significant tasks in both cheminformatics and bioinformatics are implemented in a pedagogical fashion, allowing the material to be used for teaching as well as the starting point for novel research. In this framework, a special pipeline is dedicated to kinases, a family of proteins which are known to be involved in diseases such as cancer. The aim is to gain insights into off-targets, i.e. proteins that are unintentionally affected by a compound, and that can cause adverse effects in treatments. Four measures of kinase similarity are implemented, taking into account sequence, and structural information, as well as protein-ligand interaction, and ligand profiling data. The workflow provides clustering of a set of kinases, which can be further analyzed to understand off-target effects of inhibitors. Results show that analyzing kinases using several perspectives is crucial for the insight into off-target prediction, and gaining a global perspective of the kinome. These novel methods can be exploited in the discovery of new drugs, and more specifically diseases involved in the dysregulation of kinases, such as cancer

    Transmembrane protein structure prediction using machine learning

    Get PDF
    This thesis describes the development and application of machine learning-based methods for the prediction of alpha-helical transmembrane protein structure from sequence alone. It is divided into six chapters. Chapter 1 provides an introduction to membrane structure and dynamics, membrane protein classes and families, and membrane protein structure prediction. Chapter 2 describes a topological study of the transmembrane protein CLN3 using a consensus of bioinformatic approaches constrained by experimental data. Mutations in CLN3 can cause juvenile neuronal ceroid lipofuscinosis, or Batten disease, an inherited neurodegenerative lysosomal storage disease affecting children, therefore such studies are important for directing further experimental work into this incurable illness. Chapter 3 explores the possibility of using biologically meaningful signatures described as regular expressions to influence the assignment of inside and outside loop locations during transmembrane topology prediction. Using this approach, it was possilbe to modify a recent topology prediction method leading to an improvement of 6% prediction accuracy using a standard data set. Chapter 4 describes the development of a novel support vector machine-based topology predictor that integrates both signal peptide and re-entrant helix prediction, benchmarked with full cross-validation on a novel data set of sequences with known crystal structures. The method achieves state-of-the-art performance in predicting topology and discriminating between globular and transmembrane proteins. We also present the results of applying these tools to a number of complete genomes. Chapter 5 describes a novel approach to predict lipid exposure, residue contacts, helix-helix interactions and finally the optimal helical packing arrangement of transmembrane proteins. It is based on two support vector machine classifiers that predict per residue lipid exposure and residue contacts, which are used to determine helix-helix interaction with up to 65% accuracy. The method is also able to discriminate native from decoy helical packing arrangements with up to 70% accuracy. Finally, a force-directed algorithm is employed to construct the optimal helical packing arrangement which demonstrates success for proteins containing up to 13 transmembrane helices. The final chapter summarises the major contributions of this thesis to biology, before future perspectives for TM protein structure prediction are discussed

    Expression data dnalysis and regulatory network inference by means of correlation patterns

    Get PDF
    With the advance of high-throughput techniques, the amount of available data in the bio-molecular field is rapidly growing. It is now possible to measure genome-wide aspects of an entire biological system as a whole. Correlations that emerge due to internal dependency structures of these systems entail the formation of characteristic patterns in the corresponding data. The extraction of these patterns has become an integral part of computational biology. By triggering perturbations and interventions it is possible to induce an alteration of patterns, which may help to derive the dependency structures present in the system. In particular, differential expression experiments may yield alternate patterns that we can use to approximate the actual interplay of regulatory proteins and genetic elements, namely, the regulatory network of a cell. In this work, we examine the detection of correlation patterns from bio-molecular data and we evaluate their applicability in terms of protein contact prediction, experimental artifact removal, the discovery of unexpected expression patterns and genome-scale inference of regulatory networks. Correlation patterns are not limited to expression data. Their analysis in the context of conserved interfaces among proteins is useful to estimate whether these may have co-evolved. Patterns that hint on correlated mutations would then occur in the associated protein sequences as well. We employ a conceptually simple sampling strategy to decide whether or not two pathway elements share a conserved interface and are thus likely to be in physical contact. We successfully apply our method to a system of ABC-transporters and two-component systems from the phylum of Firmicute bacteria. For spatially resolved gene expression data like microarrays, the detection of artifacts, as opposed to noise, corresponds to the extraction of localized patterns that resemble outliers in a given region. We develop a method to detect and remove such artifacts using a sliding-window approach. Our method is very accurate and it is shown to adapt to other platforms like custom arrays as well. Further, we developed Padesco as a way to reveal unexpected expression patterns. We extract frequent and recurring patterns that are conserved across many experiments. For a specific experiment, we predict whether a gene deviates from its expected behaviour. We show that Padesco is an effective approach for selecting promising candidates from differential expression experiments. In Chapter 5, we then focus on the inference of genome-scale regulatory networks from expression data. Here, correlation patterns have proven useful for the data-driven estimation of regulatory interactions. We show that, for reliable eukaryotic network inference, the integration of prior networks is essential. We reveal that this integration leads to an over-estimate of network-wide quality estimates and suggest a corrective procedure, CoRe, to counterbalance this effect. CoRe drastically improves the false discovery rate of the originally predicted networks. We further suggest a consensus approach in combination with an extended set of topological features to obtain a more accurate estimate of the eukaryotic regulatory network for yeast. In the course of this work we show how correlation patterns can be detected and how they can be applied for various problem settings in computational molecular biology. We develop and discuss competitive approaches for the prediction of protein contacts, artifact repair, differential expression analysis, and network inference and show their applicability in practical setups.Mit der Weiterentwicklung von Hochdurchsatztechniken steigt die Anzahl verfügbarer Daten im Bereich der Molekularbiologie rapide an. Es ist heute möglich, genomweite Aspekte eines ganzen biologischen Systems komplett zu erfassen. Korrelationen, die aufgrund der internen Abhängigkeits-Strukturen dieser Systeme enstehen, führen zu charakteristischen Mustern in gemessenen Daten. Die Extraktion dieser Muster ist zum integralen Bestandteil der Bioinformatik geworden. Durch geplante Eingriffe in das System ist es möglich Muster-Änderungen auszulösen, die helfen, die Abhängigkeits-Strukturen des Systems abzuleiten. Speziell differentielle Expressions-Experimente können Muster-Wechsel bedingen, die wir verwenden können, um uns dem tatsächlichen Wechselspiel von regulatorischen Proteinen und genetischen Elementen anzunähern, also dem regulatorischen Netzwerk einer Zelle. In der vorliegenden Arbeit beschäftigen wir uns mit der Erkennung von Korrelations-Mustern in molekularbiologischen Daten und schätzen ihre praktische Nutzbarkeit ab, speziell im Kontext der Kontakt-Vorhersage von Proteinen, der Entfernung von experimentellen Artefakten, der Aufdeckung unerwarteter Expressions-Muster und der genomweiten Vorhersage regulatorischer Netzwerke. Korrelations-Muster sind nicht auf Expressions-Daten beschränkt. Ihre Analyse im Kontext konservierter Schnittstellen zwischen Proteinen liefert nützliche Hinweise auf deren Ko-Evolution. Muster die auf korrelierte Mutationen hinweisen, würden in diesem Fall auch in den entsprechenden Proteinsequenzen auftauchen. Wir nutzen eine einfache Sampling-Strategie, um zu entscheiden, ob zwei Elemente eines Pathways eine gemeinsame Schnittstelle teilen, berechnen also die Wahrscheinlichkeit für deren physikalischen Kontakt. Wir wenden unsere Methode mit Erfolg auf ein System von ABC-Transportern und Zwei-Komponenten-Systemen aus dem Firmicutes Bakterien-Stamm an. Für räumlich aufgelöste Expressions-Daten wie Microarrays enspricht die Detektion von Artefakten der Extraktion lokal begrenzter Muster. Im Gegensatz zur Erkennung von Rauschen stellen diese innerhalb einer definierten Region Ausreißer dar. Wir entwickeln eine Methodik, um mit Hilfe eines Sliding-Window-Verfahrens, solche Artefakte zu erkennen und zu entfernen. Das Verfahren erkennt diese sehr zuverlässig. Zudem kann es auf Daten diverser Plattformen, wie Custom-Arrays, eingesetzt werden. Als weitere Möglichkeit unerwartete Korrelations-Muster aufzudecken, entwickeln wir Padesco. Wir extrahieren häufige und wiederkehrende Muster, die über Experimente hinweg konserviert sind. Für ein bestimmtes Experiment sagen wir vorher, ob ein Gen von seinem erwarteten Verhalten abweicht. Wir zeigen, dass Padesco ein effektives Vorgehen ist, um vielversprechende Kandidaten eines differentiellen Expressions-Experiments auszuwählen. Wir konzentrieren uns in Kapitel 5 auf die Vorhersage genomweiter regulatorischer Netzwerke aus Expressions-Daten. Hierbei haben sich Korrelations-Muster als nützlich für die datenbasierte Abschätzung regulatorischer Interaktionen erwiesen. Wir zeigen, dass für die Inferenz eukaryotischer Systeme eine Integration zuvor bekannter Regulationen essentiell ist. Unsere Ergebnisse ergeben, dass diese Integration zur Überschätzung netzwerkübergreifender Qualitätsmaße führt und wir schlagen eine Prozedur - CoRe - zur Verbesserung vor, um diesen Effekt auszugleichen. CoRe verbessert die False Discovery Rate der ursprünglich vorhergesagten Netzwerke drastisch. Weiterhin schlagen wir einen Konsensus-Ansatz in Kombination mit einem erweiterten Satz topologischer Features vor, um eine präzisere Vorhersage für das eukaryotische Hefe-Netzwerk zu erhalten. Im Rahmen dieser Arbeit zeigen wir, wie Korrelations-Muster erkannt und wie sie auf verschiedene Problemstellungen der Bioinformatik angewandt werden können. Wir entwickeln und diskutieren Ansätze zur Vorhersage von Proteinkontakten, Behebung von Artefakten, differentiellen Analyse von Expressionsdaten und zur Vorhersage von Netzwerken und zeigen ihre Eignung im praktischen Einsatz

    Mitochondial parts, pathways, and pathogenesis

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2009.Cataloged from PDF version of thesis.Includes bibliographical references.In title on title page, the word "Mitochondrial" is spelled "Mitochondial."Mitochondria are cellular compartments that perform essential roles in energy metabolism, ion homeostasis, and apoptosis. Mitochondrial dysfunction causes disease in 1 in 5,000 live births and also has been associated with aging, neurodegeneration, cancer, and diabetes. To systematically explore the function of mitochondria in health and in disease, it is necessary to identify all of the proteins resident in this organelle and to understand how they integrate into pathways. However, traditional molecular and biochemistry methods have identified only half of the estimated 1200 mitochondrial proteins, including the 13 encoded by the tiny mitochondrial genome. Now, newly available genomic technologies make it possible to identify the remainder and explore their roles in cellular pathways and disease. Toward this goal, we performed mass spectrometry, GFP tagging, and machine learning on multiple genomic datasets to create a mitochondrial compendium of 1098 genes and their protein expression across 14 mouse tissues. We linked poorly characterized proteins in this inventory to known mitochondrial pathways by virtue of shared evolutionary history. We additionally used our matched mRNA and protein measurements to demonstrate a widespread role of upstream open reading frames (uORFs) in blunting translation of mitochondrial and other cellular proteins. Next we used the mitochondrial protein inventory to identify genes underlying inherited diseases of mitochondrial dysfunction. In collaboration with clinicians, we identified causal mutations in five genes underlying diseases including hepatocerebral mtDNA depletion syndrome, autosomal dominant mitochondrial myopathy, and several forms of inherited complex I deficiency. These discoveries have enabled the development of diagnostic tests now widely available. More broadly, the mitochondrial compendium provides a foundation for systematically exploring the organelle's contribution to both basic cellular biology and human disease.by Sarah E. Calvo.Ph.D

    Efficient Learning and Evaluation of Complex Concepts in Inductive Logic Programming

    No full text
    Inductive Logic Programming (ILP) is a subfield of Machine Learning with foundations in logic programming. In ILP, logic programming, a subset of first-order logic, is used as a uniform representation language for the problem specification and induced theories. ILP has been successfully applied to many real-world problems, especially in the biological domain (e.g. drug design, protein structure prediction), where relational information is of particular importance. The expressiveness of logic programs grants flexibility in specifying the learning task and understandability to the induced theories. However, this flexibility comes at a high computational cost, constraining the applicability of ILP systems. Constructing and evaluating complex concepts remain two of the main issues that prevent ILP systems from tackling many learning problems. These learning problems are interesting both from a research perspective, as they raise the standards for ILP systems, and from an application perspective, where these target concepts naturally occur in many real-world applications. Such complex concepts cannot be constructed or evaluated by parallelizing existing top-down ILP systems or improving the underlying Prolog engine. Novel search strategies and cover algorithms are needed. The main focus of this thesis is on how to efficiently construct and evaluate complex hypotheses in an ILP setting. In order to construct such hypotheses we investigate two approaches. The first, the Top Directed Hypothesis Derivation framework, implemented in the ILP system TopLog, involves the use of a top theory to constrain the hypothesis space. In the second approach we revisit the bottom-up search strategy of Golem, lifting its restriction on determinate clauses which had rendered Golem inapplicable to many key areas. These developments led to the bottom-up ILP system ProGolem. A challenge that arises with a bottom-up approach is the coverage computation of long, non-determinate, clauses. Prolog’s SLD-resolution is no longer adequate. We developed a new, Prolog-based, theta-subsumption engine which is significantly more efficient than SLD-resolution in computing the coverage of such complex clauses. We provide evidence that ProGolem achieves the goal of learning complex concepts by presenting a protein-hexose binding prediction application. The theory ProGolem induced has a statistically significant better predictive accuracy than that of other learners. More importantly, the biological insights ProGolem’s theory provided were judged by domain experts to be relevant and, in some cases, novel
    corecore