4 research outputs found

    Graph Kernels and Applications in Bioinformatics

    Get PDF
    In recent years, machine learning has emerged as an important discipline. However, despite the popularity of machine learning techniques, data in the form of discrete structures are not fully exploited. For example, when data appear as graphs, the common choice is the transformation of such structures into feature vectors. This procedure, though convenient, does not always effectively capture topological relationships inherent to the data; therefore, the power of the learning process may be insufficient. In this context, the use of kernel functions for graphs arises as an attractive way to deal with such structured objects. On the other hand, several entities in computational biology applications, such as gene products or proteins, may be naturally represented by graphs. Hence, the demanding need for algorithms that can deal with structured data poses the question of whether the use of kernels for graphs can outperform existing methods to solve specific computational biology problems. In this dissertation, we address the challenges involved in solving two specific problems in computational biology, in which the data are represented by graphs. First, we propose a novel approach for protein function prediction by modeling proteins as graphs. For each of the vertices in a protein graph, we propose the calculation of evolutionary profiles, which are derived from multiple sequence alignments from the amino acid residues within each vertex. We then use a shortest path graph kernel in conjunction with a support vector machine to predict protein function. We evaluate our approach under two instances of protein function prediction, namely, the discrimination of proteins as enzymes, and the recognition of DNA binding proteins. In both cases, our proposed approach achieves better prediction performance than existing methods. Second, we propose two novel semantic similarity measures for proteins based on the gene ontology. The first measure directly works on the gene ontology by combining the pairwise semantic similarity scores between sets of annotating terms for a pair of input proteins. The second measure estimates protein semantic similarity using a shortest path graph kernel to take advantage of the rich semantic knowledge contained within ontologies. Our comparison with other methods shows that our proposed semantic similarity measures are highly competitive and the latter one outperforms state-of-the-art methods. Furthermore, our two methods are intrinsic to the gene ontology, in the sense that they do not rely on external sources to calculate similarities

    Empirische Risiko-Minimierung für dynamische Datenstrukturen

    Get PDF
    Strukturen in Datensätzen sollen häufig durch einen funktionalen Zusammenhang dargestellt werden. Die Grundlage zur bestmöglichen Anpassung einer Funktion an die vorliegende Datenstruktur bezüglich eines geeignet gewählten Maßes ist in der Regel die Minimierung eines erwarteten Verlusts, des Risikos. Bei unbekannter Verteilung ist das empirische Risiko ein nahe liegender Ersatz. Bei unabhängig identisch verteilten Beobachtungen und nur geringen Voraussetzungen hat dieses empirische Risikominimierungsverfahren (ERM-Prinzip) gute Konsistenzeigenschaften. Die Theorie ist zusammen mit der darauf aufbauenden strukturellen Risiko-Minimierung die Grundlage für verschiedene Methoden der statistischen Lerntheorie, wie z.B. Support Vector Machines (SVM). Auf Grund der limitierenden Voraussetzungen des ERM-Prinzips ist es nicht zulässig, die SVM auf Daten mit Abhängigkeitsstrukturen anzuwenden. Die Analyse dynamischer, meist zeitlicher Strukturen nimmt aber einen immer größeren Platz in der modernen Datenanalyse ein, so dass eine Anwendung des Prinzips der empirischen Risiko-Minimierung auf solche Daten wünschenswert ist. Dazu muss die Theorie so erweitert werden, dass die Dynamik in den Daten als stochastischer Prozess auf den Fehlerterm innerhalb der Daten wirkt. In der vorliegenden Arbeit kann dafür die Konsistenz der empirischen Risiko-Minimierung durch Ausnutzen von Konsistenzsätzen der Martingal- und vor allem der Mixingal-Theorie nachgewiesen werden. Dadurch sind zahlreiche unterschiedliche Annahmen an die Abhängigkeitsstruktur in den Fehlern möglich. Zusätzlich ist für die Anwendung des ERM-Prinzips bei der Entwicklung von geeigneten Algorithmen eine exponentielle Konvergenzrate von entscheidender Bedeutung. Für Martingal- und auch Mixingal-Strukturen in den Daten können geeignete exponentielle Schranken nachgewiesen werden, die eine schnelle Konvergenz sicherstellen.Die empirische Risiko-Minimierung bildet somit auch bei Mixingal- und Martingal-Strukturen ein allgemeingültiges Prinzip. Damit kann der konzeptionell theoretische Teil der statistischen Lerntheorie nach Vapnik auch für dynamische Datenstrukturen genutzt werden.We consider the task of finding a functional relationship in a set of data. Given an appropriate set of functions to choose from, this leads to the minimization of an expected loss, i.e. a risk, with respect to a suitable measure. In the case when the underlying probability distribution is unknown the empirical risk is an obvious estimator that can be employed for the minimization problem. This empirical risk minimization principle (ERM-principle) has good consistency properties in the case of independent and identically distributed observations. The theory together with the Structural Risk Minimization, which based on it, is the basis for different methods in the context of statistical learning theory, like Support Vector Machines (SVM).The limiting assumptions of the ERM-principle do not permit an application of the SVM on data with dependence structures. However, the analysis of dynamical, usually temporal structures becomes more and more important in modern data analysis and an application of empirical risk minimization to data of this kind is desirable. The extension of this principle for cases of time dependent data has to include the modeled dynamic structure in the data. Thereby the dynamics are not represented directly, but by modeling the errors in the data as a dynamical stochastic process. The proof of the consistency of the ERM-principle under these more general assumptions is given using consistency theorems for Martingales and Mixingales as well, such that different temporal structures in the errors are possible. In addition, an exponential convergence rate is of crucial importance for the application of the ERM-Principle and for the development of appropriate algorithms. Suitable exponential bounds are proven for Martingale and Mixingale structures as well, which guarantee fast convergence.Thus, empirical risk minimization constitutes a general principle with Mixingale or Martingale structures in the data and the conceptional theoretical part of the statistical learning theory can be used with independent data as well as with dynamical structures

    Statistical Learning Theory, Capacity and Complexity

    No full text
    We give an exposition of the ideas of statistical learning theory, followed by a discussion of how a reinterpretation of the insights of learning theory could potentially also benefit our understanding of a certain notion of complexity
    corecore