38 research outputs found

    Combinatorial algorithm for counting small induced graphs and orbits

    Full text link
    Graphlet analysis is an approach to network analysis that is particularly popular in bioinformatics. We show how to set up a system of linear equations that relate the orbit counts and can be used in an algorithm that is significantly faster than the existing approaches based on direct enumeration of graphlets. The algorithm requires existence of a vertex with certain properties; we show that such vertex exists for graphlets of arbitrary size, except for complete graphs and C4C_4, which are treated separately. Empirical analysis of running time agrees with the theoretical results

    Computation of Graphlet Orbits for Nodes and Edges in Sparse Graphs

    Get PDF
    Graphlet analysis is a useful tool for describing local network topology around individual nodes or edges. A node or an edge can be described by a vector containing the counts of different kinds of graphlets (small induced subgraphs) in which it appears, or the "roles" (orbits) it has within these graphlets. We implemented an R package with functions for fast computation of such counts on sparse graphs. Instead of enumerating all induced graphlets, our algorithm is based on the derived relations between the counts, which decreases the time complexity by an order of magnitude in comparison with past approaches

    Counting small patterns in networks

    Get PDF
    Networks are an often employed tool that can help us visualize and analyze binary relationships by representing the entities as a set of nodes and the relations between them as edges in the network. One type of relations in the field of bioinformatics that is often modeled by networks are interactions between pairs of proteins. Recent studies have focused on analyzing the local structure of such networks by observing small connected patterns consisting of 4 or 5 nodes, which are also known as graphlets. The nodes of graphlets are further divided into orbits by their "roles" or symmetries. The number of times a node from the network participates in each orbit forms a signature of the node's local network topology. Working under the assumption that the node's local topology is correlated with its function in the network, researchers have successfully used graphlets to predict new protein functions. The bottleneck of graphlet-based approaches is usually in the time required to count them. This restriction is becoming even more pronounced with a growing amount of available data. This dissertation focuses on improving existing graphlet counting techniques that are based on simple exhaustive enumeration. We present the algorithm Orca that counts graphlets and their orbits instead of enumerating them. It exploits relations between orbit counts to construct a system of equations that can be set up efficiently. Orca achieves this by enumerating (k-1)-node graphlets to count k-node graphlets, effectively obtaining a speed-up by a factor proportional to the maximum degree of a node in the network. In practical terms, it counts graphlets in larger protein-protein interaction networks about 50-100 times faster. Orca was designed for counting graphlets with 4 and 5 nodes. However, we adapt the approach to counting edge-orbits in addition to the original node-orbits with the same gains in run time. We also show that this approach can be generalized to graphlets of arbitrary size by identifying the necessary conditions and proving that these conditions can be fulfilled even for larger graphlets. Finally, we consider the problem of generating random graphs with prescribed graph\-let distributions. This motivated the adaptation of Orca for dynamic or changing networks, where edges can be added or removed. These changes can be a consequence of the procedure for generating a random graph or can be inherent in the network and the process it models. The generated graphs closely match the desired graphlet counts and as a consequence approximate other structural measures as well. The developed algorithm is a valuable tool for graphlet-based network analysis and a significant stepping stone towards analyzing larger and denser networks. As the fastest graphlet counting method it also presents a basis for further development of efficient pattern counting methods in graphs. This doctoral dissertation is based on three published papers that together with a chapter containing some unpublished work form the core of the dissertation

    Monitoring and analysis of spontaneous afforestation of Karst landscape in GIS environment

    Get PDF
    Članek predstavlja raziskavo zaraščanja kraške krajine. Pri analizi smo uporabili tehnike daljinskega zaznavanja, multitemporalne analize satelitskih slik v GIS okolju in statistične regresijske modele. Gozdnatost se je od leta 1935 povečala od 50,4% na 67,9%. Z regresijskim modelom smo pojasnili 71% celotne variabilnosti. Dejavniki, ki so največ prispevali k pojasnitvi zaraščanja so: nadmorska višina, razdalja do gozdnega roba, delež zaraslih površin v predhodnem obdobju, delež kmetijskih zemljišč in dve variabili, ki opisujeta intenzivnost kmetijske rabe. Če se procesi zaraščanja ne bodo bistveno spremenili, lahko do leta 2020 pričakujemo nadaljnje povečevanje gozdnatosti na 72,5%.This article presents Karst landscape spontaneous afforestation research. Remote sensing techniques, multitemporal analyses of satellite images in the GIS environment and statistical regression models were used in the research. Since 1935, the abundance of forests in this area has increased from 50.4% to 67.9%. About 71% of variability was explained with a regression model. Spontaneous afforestation is strongly influenced by the following factors: altitude, distance to the forest edge, share of afforested area in previous time periods, share of agricultural land, and two variables describing the intensity of agricultural use. Although various demographical, socio-economic and agro-structural factors were analysed in this research study, their influence on the process of spontaneous afforestation could not be established. If there are no significant changes in the processes of spontaneous afforestation in the future,forest abundance can be expected to increase to 72.5% by the year 2020

    Conformal Prediction with Orange

    Get PDF
    Conformal predictors estimate the reliability of outcomes made by supervised machine learning models. Instead of a point value, conformal prediction defines an outcome region that meets a user-specified reliability threshold. Provided that the data are independently and identically distributed, the user can control the level of the prediction errors and adjust it following the requirements of a given application. The quality of conformal predictions often depends on the choice of nonconformity estimate for a given machine learning method. To promote the selection of a successful approach, we have developed Orange3-Conformal, a Python library that provides a range of conformal prediction methods for classification and regression. The library also implements several nonconformity scores. It has a modular design and can be extended to add new conformal prediction methods and nonconformities

    Štetje majhnih vzorcev v omrežjih

    Full text link
    Omrežja pogosto uporabljamo za vizualizacijo in analizo relacij med pari entitet, ki jih predstavimo z množico vozlišč, povezave med njimi pa predstavljajo relacije. Ena izmed relacij v bioinformatiki, ki jo pogosto modeliramo z omrežji, so interakcije med pari proteinov. Nedavne študije v zvezi z lokalno strukturo takih omrežij so uporabljale majhne povezane vzorce s 4 ali 5 vozlišči, ki jim rečemo tudi grafki. Vozlišča grafkov se običajno delijo v orbite glede na njihovo "vlogo" oz. simetrije. Kolikokrat neko vozlišče v omrežju nastopa v vsaki izmed orbit, predstavlja neke vrste podpis lokalne strukture v okolici vozlišča. Z zanašanjem na predpostavko, da je lokalna struktura vozlišča povezana z njegovo funkcijo v omrežju, je raziskovalcem uspelo z uporabo grafkov napovedati nove funkcije proteinov. Glavna ovira pristopov na osnovi grafkov je običajno v času, ki ga zahteva štetje grafkov. Ta omejitev je vedno bolj izrazita zaradi vedno večje količine razpoložljivih podatkov. Disertacija se posveča izboljšavi obstoječih metod za štetje grafkov. Te namreč delujejo na osnovi enostavnega izčrpnega naštevanja vseh grafkov v omrežju. V disertaciji predstavljen algoritem Orca prešteje grafke, ne da bi jih naštel, kot to počnejo ostale metode. Izkorišča povezave med frekvencami orbit za pripravo sistema enačb, ki ga sestavi izredno učinkovito. Orca za štetje grafkov velikosti k našteje zgolj grafke velikosti k-1. Tako doseže pohitritev, ki je sorazmerna največji stopnji vozlišča v omrežju. V praksi to pomeni, da prešteje grafke v večjih omrežjih proteinskih interakcij 50 do 100-krat hitreje. Algoritem Orca je bil v osnovi razvit za štetje grafkov in orbit vozlišč velikosti 4 in 5. Pristop smo uspešno prilagodili tudi štetju orbit povezav z enakimi prihranki glede časa izvajanja. Rešitev je možno posplošiti za štetje poljubno velikih grafkov. V ta namen smo identificirali potrebne pogoje in dokazali, da jih je mogoče izpolniti tudi v primeru štetja večjih grafkov. Disertacija se posveča tudi problemu generiranja naključnih omrežij s predpisano porazdelitvijo grafkov. Ta problem predstavlja motivacijo za prilagoditev algoritma Orca za uporabo v dinamičnih oz. spreminjajočih omrežjih, kjer lahko nastajajo nove povezave ali pa obstoječe propadajo. Spremembe so lahko posledica postopka za generiranje naključnega omrežja ali pa so del procesa, ki ga omrežje modelira. Generirana omrežja se zelo približajo želeni porazdelitvi grafkov. Poleg števila grafkov pa so si podobna tudi po drugih merah lokalne strukture omrežij. Razviti algoritem je pomembno orodje za analizo omrežij z grafki in predstavlja pomemben korak k analizi večjih in gostejših omrežij. Kot najhitrejša metoda štetja grafkov je tudi osnova nadaljnjega raziskovanja učinkovitih metod štetja vzorcev v omrežjih. Doktorska disertacija temelji na treh objavljenih znanstvenih člankih, ki skupaj s poglavjem, ki vsebuje še neobjavljeno delo, tvorijo jedro disertacije.Networks are an often employed tool that can help us visualize and analyze binary relationships by representing the entities as a set of nodes and the relations between them as edges in the network. One type of relations in the field of bioinformatics that is often modeled by networks are interactions between pairs of proteins. Recent studies have focused on analyzing the local structure of such networks by observing small connected patterns consisting of 4 or 5 nodes, which are also known as graphlets. The nodes of graphlets are further divided into orbits by their "roles" or symmetries. The number of times a node from the network participates in each orbit forms a signature of the node\u27s local network topology. Working under the assumption that the node\u27s local topology is correlated with its function in the network, researchers have successfully used graphlets to predict new protein functions. The bottleneck of graphlet-based approaches is usually in the time required to count them. This restriction is becoming even more pronounced with a growing amount of available data. This dissertation focuses on improving existing graphlet counting techniques that are based on simple exhaustive enumeration. We present the algorithm Orca that counts graphlets and their orbits instead of enumerating them. It exploits relations between orbit counts to construct a system of equations that can be set up efficiently. Orca achieves this by enumerating (k-1)-node graphlets to count k-node graphlets, effectively obtaining a speed-up by a factor proportional to the maximum degree of a node in the network. In practical terms, it counts graphlets in larger protein-protein interaction networks about 50-100 times faster. Orca was designed for counting graphlets with 4 and 5 nodes. However, we adapt the approach to counting edge-orbits in addition to the original node-orbits with the same gains in run time. We also show that this approach can be generalized to graphlets of arbitrary size by identifying the necessary conditions and proving that these conditions can be fulfilled even for larger graphlets. Finally, we consider the problem of generating random graphs with prescribed graph-let distributions. This motivated the adaptation of Orca for dynamic or changing networks, where edges can be added or removed. These changes can be a consequence of the procedure for generating a random graph or can be inherent in the network and the process it models. The generated graphs closely match the desired graphlet counts and as a consequence approximate other structural measures as well. The developed algorithm is a valuable tool for graphlet-based network analysis and a significant stepping stone towards analyzing larger and denser networks. As the fastest graphlet counting method it also presents a basis for further development of efficient pattern counting methods in graphs. This doctoral dissertation is based on three published papers that together with a chapter containing some unpublished work form the core of the dissertation

    Counting small patterns in networks

    Get PDF
    Networks are an often employed tool that can help us visualize and analyze binary relationships by representing the entities as a set of nodes and the relations between them as edges in the network. One type of relations in the field of bioinformatics that is often modeled by networks are interactions between pairs of proteins. Recent studies have focused on analyzing the local structure of such networks by observing small connected patterns consisting of 4 or 5 nodes, which are also known as graphlets. The nodes of graphlets are further divided into orbits by their "roles" or symmetries. The number of times a node from the network participates in each orbit forms a signature of the node's local network topology. Working under the assumption that the node's local topology is correlated with its function in the network, researchers have successfully used graphlets to predict new protein functions. The bottleneck of graphlet-based approaches is usually in the time required to count them. This restriction is becoming even more pronounced with a growing amount of available data. This dissertation focuses on improving existing graphlet counting techniques that are based on simple exhaustive enumeration. We present the algorithm Orca that counts graphlets and their orbits instead of enumerating them. It exploits relations between orbit counts to construct a system of equations that can be set up efficiently. Orca achieves this by enumerating (k-1)-node graphlets to count k-node graphlets, effectively obtaining a speed-up by a factor proportional to the maximum degree of a node in the network. In practical terms, it counts graphlets in larger protein-protein interaction networks about 50-100 times faster. Orca was designed for counting graphlets with 4 and 5 nodes. However, we adapt the approach to counting edge-orbits in addition to the original node-orbits with the same gains in run time. We also show that this approach can be generalized to graphlets of arbitrary size by identifying the necessary conditions and proving that these conditions can be fulfilled even for larger graphlets. Finally, we consider the problem of generating random graphs with prescribed graph\-let distributions. This motivated the adaptation of Orca for dynamic or changing networks, where edges can be added or removed. These changes can be a consequence of the procedure for generating a random graph or can be inherent in the network and the process it models. The generated graphs closely match the desired graphlet counts and as a consequence approximate other structural measures as well. The developed algorithm is a valuable tool for graphlet-based network analysis and a significant stepping stone towards analyzing larger and denser networks. As the fastest graphlet counting method it also presents a basis for further development of efficient pattern counting methods in graphs. This doctoral dissertation is based on three published papers that together with a chapter containing some unpublished work form the core of the dissertation

    Parallel string matching algorithms

    Get PDF
    This thesis presents different string searching algorithms. The string searching or string matching problem is one of the most basic problems on strings. It resurfaced with the development of bioinformatics and the need for DNA sequence analysis. It also presents a foundation for solving other, more complex string problems. First, we describe classical algoritms with linear time complexity such as Knuth-Morris-Pratt and Rabin-Karp. Then we turned our attention to the possibilites of parallelization. We present a simple parallelization scheme which finds the pattern in a string of size N in O(sqrt(N)) time on O(sqrt(N)) processors and a more complex Vishkin algoritm which solves it in O(log(N)) time. The algorithms were compared on real as well as on degenerate test cases. They were implemented in C++ and with the use of OpenMP library for parallelization. We determined that the basic algorithms were sufficient for most practical purposes. The degenerate cases can also be solved efficiently with sequential linear time algorithms. However, the experiments on parallelization showed that multicore computers these days differ too much from computation models to reach the expected theoretic speedup
    corecore