432 research outputs found

    Improving consensus contact prediction via server correlation reduction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them.</p> <p>Results</p> <p>In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top <it>L</it>/5 predicted contacts are evaluated where <it>L </it>is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively.</p> <p>Conclusion</p> <p>Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.</p

    Benchmarking Activation Functions of Deep Neural Networks for Protein Secondary Structure Prediction Using the Stuttgart Neural Network Simulator

    Get PDF
    Οι πρωτεΐνες, μακρομόρια που αποτελούνται από την αλληλουχία αμινοξέων, η οποία προκύπτει από την μετάφραση του RNΑ, συνιστούν τη βάση όλων των βιολογικών διεργασιών που λαμβάνουν χώρα εντός του ανθρώπινου σώματος. Οι αλυσίδες αυτές των αμινοξέων, καθώς εκτείνονται στον τρισδιάστατο χώρο, τείνουν να λαμβάνουν χαρακτηριστικές δομές οι οποίες έχουν συνδεθεί με τη λειτουργία τους. Αυτό δύναται να υποστηριχθεί από την λειτουργική ομοιότητα μεταξύ όμοιων δομικά πρωτεϊνών. Τεχνικές ικανές να προβλέψουν με ακρίβεια τη δομή των πρωτεϊνών, με δεδομένη την αλληλουχία των αμινοξέων, έχουν τη δυνατότητα να αποτελέσουν ένα ισχυρό εργαλείο στην ιατρική καθώς και σε πολλούς άλλους βιολογικούς και βιοτεχνολογικούς κλάδους. Το γεγονός αυτό έχει οδηγήσει στην ανάπτυξη νέων μεθοδολογιών ή στην εκμετάλλευση τεχνολογικών τάσεων που έχουν ήδη αποδειχθεί αποτελεσματικές σε άλλες επιστημονικές εφαρμογές, προκειμένου να επιτευχθεί ικανότητα ασφαλούς πρόβλεψης μη χαρακτηρισμένων δομικά πρωτεϊνών. Μια επιστημονική προσέγγιση η οποία φαίνεται να έχει τη δυνατότητα να επιτύχει σε αυτό το εγχείρημα, είναι η χρήση νευρωνικών δικτύων (Neural Networks – NN), αλγορίθμων δηλαδή που έχουν την ικανότητα να γράφουν οι ίδιοι τις οδηγίες πρόβλεψης, μέσω εκμάθησης της αναγνώρισης μοτίβων εντός δεδομένων από ήδη γνωστά παραδείγματα σε ένα ελεγχόμενο σενάριο. Οι εμπνευσμένες από τον εγκέφαλο αρχιτεκτονικές αυτές παράγουν σταθερά αποτελέσματα σε πλήθος επιστημονικών εφαρμογών, όπως είναι η όραση υπολογιστών, με την εφαρμογή τους να δοκιμάζεται ήδη σε ένα πλήθος βιομηχανιών. Στην παρούσα μελέτη, το λογισμικό Stuttgart Neural Network Simulator (SNNS), ένας προσομοιωτής νευρωνικών δικτύων, χρησιμοποιήθηκε για τη δημιουργία, εκπαίδευση και έλεγχο ως προς την ακρίβεια, νευρωνικών δικτύων ρηχής (shallow) και βαθιάς (deep) αρχιτεκτονικής, αποτελούμενης από ένα, δύο και τρία εσωτερικά στρώματα (hidden layers) με ποικίλους συνδυασμούς αριθμών κόμβων (nodes) για το κάθε στρώμα. Ο στόχος ήταν η σύγκριση των επιδόσεών τους ως προς την ακρίβεια πρόβλεψης μεταξύ ιδίων αρχιτεκτονικών αλλά διαφορετικών συναρτήσεων ενεργοποίησης (activation functions). Για το λόγο αυτό, έλαβε χώρα η απαιτούμενη επέκταση του SNNS προκειμένου να εμπεριέχει τις προς μελέτη συναρτήσεις. Οι συναρτήσεις ενεργοποίησης που χρησιμοποιήθηκαν ήταν η Logistic, η Rectified Linear Unit (ReLU) και η Leaky Rectified Linear Unit (LReLU), με τα ικανότερα προβλεπτικά δίκτυα να είναι αυτά που χρησιμοποιούσαν την ReLU για ένα επίπεδο βάθους και υψηλούς αριθμούς μεταβλητών, γεγονός που επιβεβαίωσε την αρχική μας υπόθεση που ήθελε τα δίκτυα που χρησιμοποιούν τη ReLU να ξεπερνούν σε απόδοση τα υπόλοιπα. Το νευρωνικό δίκτυο με την καλύτερη απόδοση συνολικά, αποτελούταν από ένα επίπεδο βάθους με 140 κόμβους που χρησιμοποιούσε την ReLU, επιτυγχάνοντας ακρίβεια 67.85% σε ένα νέο test set.Proteins, the macromolecules consisting of amino acid sequences resulting from RNA translation, are the basis for the majority of biological processes that occur in any life form. These amino acid chains, while spreading in the three-dimensional space, tend to achieve characteristic conformations that are found to be directly linked to their function. This can be backed by the functional similarity between proteins with similar conformation. Techniques able to properly predict the structure of proteins from the amino acid sequences, could provide medicine and other biological and biotechnological fields with an invaluable tool. This has led to the development of new methodologies or the utilization of technological trends already applied in another fields, to achieve predictability of non-characterized proteins. A scientific approach that has the potential to succeed in this endeavor, is the use of artificial neural networks (ANNs), algorithms that adapt their structure for prediction via learning from a training set consisting of already known examples in the supervised scenario. These brain-inspired architectures produce solid results in many scientific areas, such as computer vision, and are being tested for possible fit in a plethora of industries. In the current study, the Stuttgart Neural Network Simulator (SNNS), a software simulator for neural networks, was used to create, train and test for accuracy, neural networks of shallow and deep architectures that consisted of one, two and three hidden layers with many combinations of numbers of nodes for each layer. The goal was to compare the network performances of the same architectures but for different activation functions. To this end, SNNS was extended to support the missing activation functions. The tested activation functions were the logistic, the rectified linear unit (ReLU) and the leaky rectified linear unit (LReLU), and the best overall networks were the ones with the ReLU activation function, one hidden layer and a high number of parameters, which confirmed our initial hypothesis that ReLU will outperform the other activation functions. The best performing neural network architecture consisted of one hidden layer with 140 nodes that utilized the ReLU activation function, yielding an accuracy of 67.85% on a novel test set

    Learning-Based Modeling of Weather and Climate Events Related To El Niño Phenomenon via Differentiable Programming and Empirical Decompositions

    Get PDF
    This dissertation is the accumulation of the application of adaptive, empirical learning-based methods in the study and characterization of the El Niño Southern Oscillation. In specific, it focuses on ENSO’s effects on rainfall and drought conditions in two major regions shown to be linked through the strength of the dependence of their climate on ENSO: 1) the southern Pacific Coast of the United States and 2) the Nile River Basin. In these cases, drought and rainfall are tied to deep economic and social factors within the region. The principal aim of this dissertation is to establish, with scientific rigor, an epistemological and foundational justification of adaptive learning models and their utility in the both the modeling and understanding of a wide-reaching climate phenomenon such as ENSO. This dissertation explores a scientific justification for their proven accuracy in prediction and utility as an aide in deriving a deeper understanding of climate phenomenon. In the application of drought forecasting for Southern California, adaptive learning methods were able to forecast the drought severity of the 2015-2016 winter with greater accuracy than established models. Expanding this analysis yields novel ways to analyze and understand the underlying processes driving California drought. The pursuit of adaptive learning as a guiding tool would also lead to the discovery of a significant extractable components of ENSO strength variation, which are used with in the analysis of Nile River Basin precipitation and flow of the Nile River, and in the prediction of Nile River yield to p=0.038. In this dissertation, the duality of modeling and understanding is explored, as well as a discussion on why adaptive learning methods are uniquely suited to the study of climate phenomenon like ENSO in the way that traditional methods lack. The main methods explored are 1) differentiable Programming, as a means of construction of novel self-learning models through which the meaningfulness of parameters arises from emergent phenomenon and 2) empirical decompositions, which are driven by an adaptive rather than rigid component extraction principle, are explored further as both a predictive tool and as a tool for gaining insight and the construction of models

    Towards Automating Protein Structure Determination from NMR Data

    Get PDF
    Nuclear magnetic resonance (NMR) spectroscopy technique is becoming exceedingly significant due to its capability of studying protein structures in solution. However, NMR protein structure determination has remained a laborious and costly process until now, even with the help of currently available computer programs. After the NMR spectra are collected, the main road blocks to the fully automated NMR protein structure determination are peak picking from noisy spectra, resonance assignment from imperfect peak lists, and structure calculation from incomplete assignment and ambiguous nuclear Overhauser enhancements (NOE) constraints. The goal of this dissertation is to propose error-tolerant and highly-efficient methods that work well on real and noisy data sets of NMR protein structure determination and the closely related protein structure prediction problems. One major contribution of this dissertation is to propose a fully automated NMR protein structure determination system, AMR, with emphasis on the parts that I contributed. AMR only requires an input set with six NMR spectra. We develop a novel peak picking method, PICKY, to solve the crucial but tricky peak picking problem. PICKY consists of a noise level estimation step, a component forming step, a singular value decomposition-based initial peak picking step, and a peak refinement step. The first systematic study on peak picking problem is conducted to test the performance of PICKY. An integer linear programming (ILP)-based resonance assignment method, IPASS, is then developed to handle the imperfect peak lists generated by PICKY. IPASS contains an error-tolerant spin system forming method and an ILP-based assignment method. The assignment generated by IPASS is fed into the structure calculation step, FALCON-NMR. FALCON-NMR has a threading module, an ab initio module, an all-atom refinement module, and an NOE constraints-based decoy selection module. The entire system, AMR, is successfully tested on four out of five real proteins with practical NMR spectra, and generates 1.25A, 1.49A, 0.67A, and 0.88A to the native reference structures, respectively. Another contribution of this dissertation is to propose novel ideas and methods to solve three protein structure prediction problems which are closely related to NMR protein structure determination. We develop a novel consensus contact prediction method, which is able to eliminate server correlations, to solve the protein inter-residue contact prediction problem. We also propose an ultra-fast side chain packing method, which only uses local backbone information, to solve the protein side chain packing problem. Finally, two complementary local quality assessment methods are proposed to solve the local quality prediction problem for comparative modeling-based protein structure prediction methods

    Misinformation Detection in Social Media

    Get PDF
    abstract: The pervasive use of social media gives it a crucial role in helping the public perceive reliable information. Meanwhile, the openness and timeliness of social networking sites also allow for the rapid creation and dissemination of misinformation. It becomes increasingly difficult for online users to find accurate and trustworthy information. As witnessed in recent incidents of misinformation, it escalates quickly and can impact social media users with undesirable consequences and wreak havoc instantaneously. Different from some existing research in psychology and social sciences about misinformation, social media platforms pose unprecedented challenges for misinformation detection. First, intentional spreaders of misinformation will actively disguise themselves. Second, content of misinformation may be manipulated to avoid being detected, while abundant contextual information may play a vital role in detecting it. Third, not only accuracy, earliness of a detection method is also important in containing misinformation from being viral. Fourth, social media platforms have been used as a fundamental data source for various disciplines, and these research may have been conducted in the presence of misinformation. To tackle the challenges, we focus on developing machine learning algorithms that are robust to adversarial manipulation and data scarcity. The main objective of this dissertation is to provide a systematic study of misinformation detection in social media. To tackle the challenges of adversarial attacks, I propose adaptive detection algorithms to deal with the active manipulations of misinformation spreaders via content and networks. To facilitate content-based approaches, I analyze the contextual data of misinformation and propose to incorporate the specific contextual patterns of misinformation into a principled detection framework. Considering its rapidly growing nature, I study how misinformation can be detected at an early stage. In particular, I focus on the challenge of data scarcity and propose a novel framework to enable historical data to be utilized for emerging incidents that are seemingly irrelevant. With misinformation being viral, applications that rely on social media data face the challenge of corrupted data. To this end, I present robust statistical relational learning and personalization algorithms to minimize the negative effect of misinformation.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Performance Problem Diagnostics by Systematic Experimentation

    Get PDF
    Diagnostics of performance problems requires deep expertise in performance engineering and entails a high manual effort. As a consequence, performance evaluations are postponed to the last minute of the development process. In this thesis, we introduce an automatic, experiment-based approach for performance problem diagnostics in enterprise software systems. With this approach, performance engineers can concentrate on their core competences instead of conducting repeating tasks

    Performance Problem Diagnostics by Systematic Experimentation

    Get PDF
    In this book, we introduce an automatic, experiment-based approach for performance problem diagnostics in enterprise software systems. The proposed approach systematically searches for root causes of detected performance problems by executing series of systematic performance tests. The presented approach is evaluated by various case studies showing that the presented approach is applicable to a wide range of contexts

    Evaluation of Text Document Clustering Using K-Means

    Get PDF
    The fundamentals of human communication are language and written texts. Social media is an essential source of data on the Internet, but email and text messages are also considered to be one of the main sources of textual data. The processing and analysis of text data is conducted using text mining methods. Text Mining is the extension of Data Mining to text files to extract relevant information from large amounts of text data and to recognize patterns. Cluster analysis is one of the most important text mining methods. Its goal is the automatic partitioning of a number of objects into a finite set of homogeneous groups (clusters). The objects should be as similar as possible within a group. Objects from different groups, however, should have different characteristics. The starting-point of cluster analysis is a precise definition of the task and the selection of representative data objects. A challenge regarding text documents is their unstructured form, which requires extensive pre-processing. For the automated processing of natural language Natural Language Processing (NLP) is used. The conversion of text files into a numerical form can be performed using the Bag-of-Words (BoW) approach or neural networks. Each data object can finally be represented as a point in a finite-dimensional space, where the dimension corresponds to the number of unique tokens, here words. Prior to the actual cluster analysis, a measure must also be defined to determine the similarity or dissimilarity between the objects. To measure dissimilarity, metrics such as Euclidean distance, for example, are used. Then clustering methods are applied. The cluster methods can be divided into different categories. On the one hand,there are methods that form a hierarchical system, which are also called hierarchical cluster methods. On the other hand, there are techniques that provide a division into groups by determining a grouping on the basis of an optimal homogeneity measure, whereby the number of groups is predetermined. The procedures of this class are called partitioning methods. An important representative is the k-Means method which is used in this thesis. The results are finally evaluated and interpreted. In this thesis, the different methods used in the individual cluster analysis steps are introduced. In order to make a statement about which method seems to be the most suitable for clustering documents, a practical investigation was carried out on the basis of three different data sets

    Bioinformatics for personal genomics: development and application of bioinformatic procedures for the analysis of genomic data

    Get PDF
    In the last decade, the huge decreasing of sequencing cost due to the development of high-throughput technologies completely changed the way for approaching the genetic problems. In particular, whole exome and whole genome sequencing are contributing to the extraordinary progress in the study of human variants opening up new perspectives in personalized medicine. Being a relatively new and fast developing field, appropriate tools and specialized knowledge are required for an efficient data production and analysis. In line with the times, in 2014, the University of Padua funded the BioInfoGen Strategic Project with the goal of developing technology and expertise in bioinformatics and molecular biology applied to personal genomics. The aim of my PhD was to contribute to this challenge by implementing a series of innovative tools and by applying them for investigating and possibly solving the case studies included into the project. I firstly developed an automated pipeline for dealing with Illumina data, able to sequentially perform each step necessary for passing from raw reads to somatic or germline variant detection. The system performance has been tested by means of internal controls and by its application on a cohort of patients affected by gastric cancer, obtaining interesting results. Once variants are called, they have to be annotated in order to define their properties such as the position at transcript and protein level, the impact on protein sequence, the pathogenicity and more. As most of the publicly available annotators were affected by systematic errors causing a low consistency in the final annotation, I implemented VarPred, a new tool for variant annotation, which guarantees the best accuracy (>99%) compared to the state-of-the-art programs, showing also good processing times. To make easy the use of VarPred, I equipped it with an intuitive web interface, that allows not only a graphical result evaluation, but also a simple filtration strategy. Furthermore, for a valuable user-driven prioritization of human genetic variations, I developed QueryOR, a web platform suitable for searching among known candidate genes as well as for finding novel gene-disease associations. QueryOR combines several innovative features that make it comprehensive, flexible and easy to use. The prioritization is achieved by a global positive selection process that promotes the emergence of the most reliable variants, rather than filtering out those not satisfying the applied criteria. QueryOR has been used to analyze the two case studies framed within the BioInfoGen project. In particular, it allowed to detect causative variants in patients affected by lysosomal storage diseases, highlighting also the efficacy of the designed sequencing panel. On the other hand, QueryOR simplified the recognition of LRP2 gene as possible candidate to explain such subjects with a Dent disease-like phenotype, but with no mutation in the previously identified disease-associated genes, CLCN5 and OCRL. As final corollary, an extensive analysis over recurrent exome variants was performed, showing that their origin can be mainly explained by inaccuracies in the reference genome, including misassembled regions and uncorrected bases, rather than by platform specific errors
    corecore