1,360 research outputs found

    Discriminating between surfaces of peripheral membrane proteins and reference proteins using machine learning algorithms

    Get PDF
    In biology, the cell membrane is an important component of a cell and usually works as a “fence” to distinguish the inside and outside of a cell. The key role is to protect the cells from being interfered by their surroundings by preventing the molecules that will enter into the cell. However as we know, cells need to keep communicating with their surroundings to acquire nutrition and other necessary molecules in order to stay alive and grow. Due to this reason, membrane proteins are used as molecular carriers to participate the molecular communication and regulate the biological activities. There are two kinds of membrane proteins: integral and peripheral. In this project, we only focus on the latter. Unlike the integral membrane proteins which can go across the whole membrane, peripheral membrane proteins can only attach to the surface of the membrane through various interactions. Because peripheral proteins are also soluble, it is difficult to differentiate them from other kinds of proteins (i.e. non membrane-binding) from sequence or structure. In this project, we will develop a method to predict from its structure wether a protein is membrane-binding protein or not based on two machine learning algorithms: k-nearest neighbors(KNN) and support vector machine(SVM). We use them to train the data and create two models respectively, which will be used to classify new proteins as well as compare their performance. By for example collecting different features of proteins, adjusting the parameters of the algorithms or changing size and structure of the dataset, we can improve the performances of the algorithms as well as predict the protein type more accurately. We also use ROC curve and AUC to present the performance in overview, and cross validation to verify the result. For the problems in this field, several challenges should be considered as well, such as collecting of features, analysis and dealing with the huge variety of data, as well as the choice of machine learning algorithms for a design based on functional requirements, data structure, efficiency and other factors. In this project, we will encounter these challenges and solve them by effective methods.Master's Thesis in InformaticsINF39

    A framework of face recognition with set of testing images

    Get PDF
    We propose a novel framework to solve the face recognition problem base on set of testing images. Our framework can handle the case that no pose overlap between training set and query set. The main techniques used in this framework are manifold alignment, face normalization and discriminant learning. Experiments on different databases show our system outperforms some state of the art methods

    Interpretable Network Representations

    Get PDF
    Networks (or interchangeably graphs) have been ubiquitous across the globe and within science and engineering: social networks, collaboration networks, protein-protein interaction networks, infrastructure networks, among many others. Machine learning on graphs, especially network representation learning, has shown remarkable performance in network-based applications, such as node/graph classification, graph clustering, and link prediction. Like performance, it is equally crucial for individuals to understand the behavior of machine learning models and be able to explain how these models arrive at a certain decision. Such needs have motivated many studies on interpretability in machine learning. For example, for social network analysis, we may need to know the reasons why certain users (or groups) are classified or clustered together by the machine learning models, or why a friend recommendation system considers some users similar so that they are recommended to connect with each other. Therefore, an interpretable network representation is necessary and it should carry the graph information to a level understandable by humans. Here, we first introduce our method on interpretable network representations: the network shape. It provides a framework to represent a network with a 3-dimensional shape, and one can customize network shapes for their need, by choosing various graph sampling methods, 3D network embedding methods and shape-fitting methods. In this thesis, we introduce the two types of network shape: a Kronecker hull which represents a network as a 3D convex polyhedron using stochastic Kronecker graphs as the network embedding method, and a Spectral Path which represents a network as a 3D path connecting the spectral moments of the network and its subgraphs. We demonstrate that network shapes can capture various properties of not only the network, but also its subgraphs. For instance, they can provide the distribution of subgraphs within a network, e.g., what proportion of subgraphs are structurally similar to the whole network? Network shapes are interpretable on different levels, so one can quickly understand the structural properties of a network and its subgraphs by its network shape. Using experiments on real-world networks, we demonstrate that network shapes can be used in various applications, including (1) network visualization, the most intuitive way for users to understand a graph; (2) network categorization (e.g., is this a social or a biological network?); (3) computing similarity between two graphs. Moreover, we utilize network shapes to extend biometrics studies to network data, by solving two problems: network identification (Given an anonymized graph, can we identify the network from which it is collected? i.e., answering questions such as ``where is this anonymized graph sampled from, Twitter or Facebook? ) and network authentication (If one claims the graph is sampled from a certain network, can we verify this claim?). The overall objective of the thesis is to provide a compact, interpretable, visualizable, comparable and efficient representation of networks

    Restricting Supervised Learning: Feature Selection and Feature Space Partition

    Get PDF
    Many supervised learning problems are considered difficult to solve either because of the redundant features or because of the structural complexity of the generative function. Redundant features increase the learning noise and therefore decrease the prediction performance. Additionally, a number of problems in various applications such as bioinformatics or image processing, whose data are sampled in a high dimensional space, suffer the curse of dimensionality, and there are not enough observations to obtain good estimates. Therefore, it is necessary to reduce such features under consideration. Another issue of supervised learning is caused by the complexity of an unknown generative model. To obtain a low variance predictor, linear or other simple functions are normally suggested, but they usually result in high bias. Hence, a possible solution is to partition the feature space into multiple non-overlapping regions such that each region is simple enough to be classified easily. In this dissertation, we proposed several novel techniques for restricting supervised learning problems with respect to either feature selection or feature space partition. Among different feature selection methods, 1-norm regularization is advocated by many researchers because it incorporates feature selection as part of the learning process. We give special focus here on ranking problems because very little work has been done for ranking using L1 penalty. We present here a 1-norm support vector machine method to simultaneously find a linear ranking function and to perform feature subset selection in ranking problems. Additionally, because ranking is formulated as a classification task when pair-wise data are considered, it increases the computational complexity from linear to quadratic in terms of sample size. We also propose a convex hull reduction method to reduce this impact. The method was tested on one artificial data set and two benchmark real data sets, concrete compressive strength set and Abalone data set. Theoretically, by tuning the trade-off parameter between the 1-norm penalty and the empirical error, any desired size of feature subset could be achieved, but computing the whole solution path in terms of the trade-off parameter is extremely difficult. Therefore, using 1-norm regularization alone may not end up with a feature subset of small size. We propose a recursive feature selection method based on 1-norm regularization which can handle the multi-class setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multi-class classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The selection process has a fast rate of convergence. We tested our method on an earthworm microarray data set and the empirical results demonstrate that the selected features (genes) have very competitive discriminative power. Feature space partition separates a complex learning problem into multiple non-overlapping simple sub-problems. It is normally implemented in a hierarchical fashion. Different from decision tree, a leaf node of this hierarchical structure does not represent a single decision, but represents a region (sub-problem) that is solvable with respect to linear functions or other simple functions. In our work, we incorporate domain knowledge in the feature space partition process. We consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. However it is not trivial to select the discrete or categorical attribute that maximally simplify the learning task. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We describe a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach was tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. Restricting supervised learning is always about building simple learning functions using a limited number of features. Top Selected Pair (TSP) method builds simple classifiers based on very few (for example, two) features with simple arithmetic calculation. However, traditional TSP method only deals with static data. In this dissertation, we propose classification methods for time series data that only depend on a few pairs of features. Based on the different comparison strategies, we developed the following approaches: TSP based on average, TSP based on trend, and TSP based on trend and absolute difference amount. In addition, inspired by the idea of using two features, we propose a time series classification method based on few feature pairs using dynamic time warping and nearest neighbor

    A topological approach for protein classification

    Full text link
    Protein function and dynamics are closely related to its sequence and structure. However prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity be- tween proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics. Persistent homology is a new branch of algebraic topology that has found its success in the topological data analysis in a variety of disciplines, including molecular biology. The present work explores the potential of using persistent homology as an indepen- dent tool for protein classification. To this end, we propose a molecular topological fingerprint based support vector machine (MTF-SVM) classifier. Specifically, we construct machine learning feature vectors solely from protein topological fingerprints, which are topological invariants generated during the filtration process. To validate the present MTF-SVM approach, we consider four types of problems. First, we study protein-drug binding by using the M2 channel protein of influenza A virus. We achieve 96% accuracy in discriminating drug bound and unbound M2 channels. Additionally, we examine the use of MTF-SVM for the classification of hemoglobin molecules in their relaxed and taut forms and obtain about 80% accuracy. The identification of all alpha, all beta, and alpha-beta protein domains is carried out in our next study using 900 proteins. We have found a 85% success in this identifica- tion. Finally, we apply the present technique to 55 classification tasks of protein superfamilies over 1357 samples. An average accuracy of 82% is attained. The present study establishes computational topology as an independent and effective alternative for protein classification

    Improving the resolution of interaction maps: A middleground between high-resolution complexes and genome-wide interactomes

    Get PDF
    Protein-protein interactions are ubiquitous in Biology and therefore central to understand living organisms. In recent years, large-scale studies have been undertaken to describe, at least partially, protein-protein interaction maps or interactomes for a number of relevant organisms including human. Although the analysis of interaction networks is proving useful, current interactomes provide a blurry and granular picture of the molecular machinery, i.e. unless the structure of the protein complex is known the molecular details of the interaction are missing and sometime is even not possible to know if the interaction between the proteins is direct, i.e. physical interaction or part of functional, not necessary, direct association. Unfortunately, the determination of the structure of protein complexes cannot keep pace with the discovery of new protein-protein interactions resulting in a large, and increasing, gap between the number of complexes that are thought to exist and the number for which 3D structures are available. The aim of the thesis was to tackle this problem by implementing computational approaches to derive structural models of protein complexes and thus reduce this existing gap. Over the course of the thesis, a novel modelling algorithm to predict the structure of protein complexes, V-D2OCK, was implemented. This new algorithm combines structure-based prediction of protein binding sites by means of a novel algorithm developed over the course of the thesis: VORFFIP and M-VORFFIP, data-driven docking and energy minimization. This algorithm was used to improve the coverage and structural content of the human interactome compiled from different sources of interactomic data to ensure the most comprehensive interactome. Finally, the human interactome and structural models were compiled in a database, V-D2OCK DB, that offers an easy and user-friendly access to the human interactome including a bespoken graphical molecular viewer to facilitate the analysis of the structural models of protein complexes. Furthermore, new organisms, in addition to human, were included providing a useful resource for the study of all known interactomes

    Machine Learning Approaches to Modeling the Physiochemical Properties of Small Peptides

    Get PDF
    Peptide and protein sequences are most commonly represented as a strings: a series of letters selected from the twenty character alphabet of abbreviations for the naturally occurring amino acids. Here, we experiment with representations of small peptide sequences that incorporate more physiochemical information. Specifically, we develop three different physiochemical representations for a set of roughly 700 HIV–I protease substrates. These different representations are used as input to an array of six different machine learning models which are used to predict whether or not a given peptide is likely to be an acceptable substrate for the protease. Our results show that, in general, higher–dimensional physiochemical representations tend to have better performance than representations incorporating fewer dimensions selected on the basis of high information content. We contend that such representations are more biologically relevant than simple string–based representations and are likely to more accurately capture peptide characteristics that are functionally important.Singapore-MIT Alliance (SMA
    • …
    corecore