    A Descriptive Tolerance Nearness Measure for Performing Graph Comparison

    Accepted versionThis article proposes the tolerance nearness measure (TNM) as a computationally reduced alternative to the graph edit distance (GED) for performing graph comparisons. The TNM is defined within the context of near set theory, where the central idea is that determining similarity between sets of disjoint objects is at once intuitive and practically applicable. The TNM between two graphs is produced using the Bron-Kerbosh maximal clique enumeration algorithm. The result is that the TNM approach is less computationally complex than the bipartite-based GED algorithm. The contribution of this paper is the application of TNM to the problem of quantifying the similarity of disjoint graphs and that the maximal clique enumeration-based TNM produces comparable results to the GED when applied to the problem of content-based image processing, which becomes important as the number of nodes in a graph increases."This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 418413."https://content.iospress.com/articles/fundamenta-informaticae/fi174

    The Maximum Common Subgraph Problem: A Parallel and Multi-Engine Approach

    The maximum common subgraph of two graphs is the largest possible common subgraph, i.e., the common subgraph with as many vertices as possible. Even if this problem is very challenging, as it has been long proven NP-hard, its countless practical applications still motivates searching for exact solutions. This work discusses the possibility to extend an existing, very effective branch-and-bound procedure on parallel multi-core and many-core architectures. We analyze a parallel multi-core implementation that exploits a divide-and-conquer approach based on a thread pool, which does not deteriorate the original algorithmic efficiency and it minimizes data structure repetitions. We also extend the original algorithm to parallel many-core GPU architectures adopting the CUDA programming framework, and we show how to handle the heavily workload-unbalance and the massive data dependency. Then, we suggest new heuristics to reorder the adjacency matrix, to deal with “dead-ends”, and to randomize the search with automatic restarts. These heuristics can achieve significant speed-ups on specific instances, even if they may not be competitive with the original strategy on average. Finally, we propose a portfolio approach, which integrates all the different local search algorithms as component tools; such portfolio, rather than choosing the best tool for a given instance up-front, takes the decision on-line. The proposed approach drastically limits memory bandwidth constraints and avoids other typical portfolio fragility as CPU and GPU versions often show a complementary efficiency and run on separated platforms. Experimental results support the claims and motivate further research to better exploit GPUs in embedded task-intensive and multi-engine parallel applications

    Dissimilarity-based learning for complex data

    Mokbel B. Dissimilarity-based learning for complex data. Bielefeld: Universität Bielefeld; 2016.Rapid advances of information technology have entailed an ever increasing amount of digital data, which raises the demand for powerful data mining and machine learning tools. Due to modern methods for gathering, preprocessing, and storing information, the collected data become more and more complex: a simple vectorial representation, and comparison in terms of the Euclidean distance is often no longer appropriate to capture relevant aspects in the data. Instead, problem-adapted similarity or dissimilarity measures refer directly to the given encoding scheme, allowing to treat information constituents in a relational manner. This thesis addresses several challenges of complex data sets and their representation in the context of machine learning. The goal is to investigate possible remedies, and propose corresponding improvements of established methods, accompanied by examples from various application domains. The main scientific contributions are the following: (I) Many well-established machine learning techniques are restricted to vectorial input data only. Therefore, we propose the extension of two popular prototype-based clustering and classification algorithms to non-negative symmetric dissimilarity matrices. (II) Some dissimilarity measures incorporate a fine-grained parameterization, which allows to configure the comparison scheme with respect to the given data and the problem at hand. However, finding adequate parameters can be hard or even impossible for human users, due to the intricate effects of parameter changes and the lack of detailed prior knowledge. Therefore, we propose to integrate a metric learning scheme into a dissimilarity-based classifier, which can automatically adapt the parameters of a sequence alignment measure according to the given classification task. (III) A valuable instrument to make complex data sets accessible are dimensionality reduction techniques, which can provide an approximate low-dimensional embedding of the given data set, and, as a special case, a planar map to visualize the data's neighborhood structure. To assess the reliability of such an embedding, we propose the extension of a well-known quality measure to enable a fine-grained, tractable quantitative analysis, which can be integrated into a visualization. This tool can also help to compare different dissimilarity measures (and parameter settings), if ground truth is not available. (IV) All techniques are demonstrated on real-world examples from a variety of application domains, including bioinformatics, motion capturing, music, and education

    Neural function approximation on graphs: shape modelling, graph discrimination & compression

    Graphs serve as a versatile mathematical abstraction of real-world phenomena in numerous scientific disciplines. This thesis is part of the Geometric Deep Learning subject area, a family of learning paradigms, that capitalise on the increasing volume of non-Euclidean data so as to solve real-world tasks in a data-driven manner. In particular, we focus on the topic of graph function approximation using neural networks, which lies at the heart of many relevant methods. In the first part of the thesis, we contribute to the understanding and design of Graph Neural Networks (GNNs). Initially, we investigate the problem of learning on signals supported on a fixed graph. We show that treating graph signals as general graph spaces is restrictive and conventional GNNs have limited expressivity. Instead, we expose a more enlightening perspective by drawing parallels between graph signals and signals on Euclidean grids, such as images and audio. Accordingly, we propose a permutation-sensitive GNN based on an operator analogous to shifts in grids and instantiate it on 3D meshes for shape modelling (Spiral Convolutions). Following, we focus on learning on general graph spaces and in particular on functions that are invariant to graph isomorphism. We identify a fundamental trade-off between invariance, expressivity and computational complexity, which we address with a symmetry-breaking mechanism based on substructure encodings (Graph Substructure Networks). Substructures are shown to be a powerful tool that provably improves expressivity while controlling computational complexity, and a useful inductive bias in network science and chemistry. In the second part of the thesis, we discuss the problem of graph compression, where we analyse the information-theoretic principles and the connections with graph generative models. We show that another inevitable trade-off surfaces, now between computational complexity and compression quality, due to graph isomorphism. We propose a substructure-based dictionary coder - Partition and Code (PnC) - with theoretical guarantees that can be adapted to different graph distributions by estimating its parameters from observations. Additionally, contrary to the majority of neural compressors, PnC is parameter and sample efficient and is therefore of wide practical relevance. Finally, within this framework, substructures are further illustrated as a decisive archetype for learning problems on graph spaces.Open Acces

    Solving hard subgraph problems in parallel

    This thesis improves the state of the art in exact, practical algorithms for finding subgraphs. We study maximum clique, subgraph isomorphism, and maximum common subgraph problems. These are widely applicable: within computing science, subgraph problems arise in document clustering, computer vision, the design of communication protocols, model checking, compiler code generation, malware detection, cryptography, and robotics; beyond, applications occur in biochemistry, electrical engineering, mathematics, law enforcement, fraud detection, fault diagnosis, manufacturing, and sociology. We therefore consider both the ``pure'' forms of these problems, and variants with labels and other domain-specific constraints. Although subgraph-finding should theoretically be hard, the constraint-based search algorithms we discuss can easily solve real-world instances involving graphs with thousands of vertices, and millions of edges. We therefore ask: is it possible to generate ``really hard'' instances for these problems, and if so, what can we learn? By extending research into combinatorial phase transition phenomena, we develop a better understanding of branching heuristics, as well as highlighting a serious flaw in the design of graph database systems. This thesis also demonstrates how to exploit two of the kinds of parallelism offered by current computer hardware. Bit parallelism allows us to carry out operations on whole sets of vertices in a single instruction---this is largely routine. Thread parallelism, to make use of the multiple cores offered by all modern processors, is more complex. We suggest three desirable performance characteristics that we would like when introducing thread parallelism: lack of risk (parallel cannot be exponentially slower than sequential), scalability (adding more processing cores cannot make runtimes worse), and reproducibility (the same instance on the same hardware will take roughly the same time every time it is run). We then detail the difficulties in guaranteeing these characteristics when using modern algorithmic techniques. Besides ensuring that parallelism cannot make things worse, we also increase the likelihood of it making things better. We compare randomised work stealing to new tailored strategies, and perform experiments to identify the factors contributing to good speedups. We show that whilst load balancing is difficult, the primary factor influencing the results is the interaction between branching heuristics and parallelism. By using parallelism to explicitly offset the commitment made to weak early branching choices, we obtain parallel subgraph solvers which are substantially and consistently better than the best sequential algorithms

    Automatic plan generation and adaptation by observation : supporting complex human planning

    Lainakappaleiden tunnistaminen tiedon tiivistämiseen perustuvia etäisyysmittoja käyttäen

    Measuring similarity in music data is a problem with various potential applications. In recent years, the task known as cover song identification has gained widespread attention. In cover song identification, the purpose is to determine whether a piece of music is a different rendition of a previous version of the composition. The task is quite trivial for a human listener, but highly challenging for a computer. This research approaches the problem from an information theoretic starting point. Assuming that cover versions share musical information with the original performance, we strive to measure the degree of this common information as the amount of computational resources needed to turn one version into another. Using a similarity measure known as normalized compression distance, we approximate the non-computable Kolmogorov complexity as the length of an object when compressed using a real-world data compression algorithm. If two pieces of music share musical information, we should be able to compress one using a model learned from the other. In order to use compression-based similarity measuring, the meaningful musical information needs to be extracted from the raw audio signal data. The most commonly used representation for this task is known as chromagram: a sequence of real-valued vectors describing the temporal tonal content of the piece of music. Measuring the similarity between two chromagrams effectively with a data compression algorithm requires further processing to extract relevant features and find a more suitable discrete representation for them. Here, the challenge is to process the data without losing the distinguishing characteristics of the music. In this research, we study the difficult nature of cover song identification and search for an effective compression-based system for the task. Harmonic and melodic features, different representations for them, commonly used data compression algorithms, and several other variables of the problem are addressed thoroughly. The research seeks to shed light on how different choices in the scheme attribute to the performance of the system. Additional attention is paid to combining different features, with several combination strategies studied. Extensive empirical evaluation of the identification system has been performed, using large sets of real-world music data. Evaluations show that the compression-based similarity measuring performs relatively well but fails to achieve the accuracy of the existing solution that measures similarity by using common subsequences. The best compression-based results are obtained by a combination of distances based on two harmonic representations obtained from chromagrams using hidden Markov model chord estimation, and an octave-folded version of the extracted salient melody representation. The most distinct reason for the shortcoming of the compression performance is the scarce amount of data available for a single piece of music. This was partially overcome by internal data duplication. As a whole, the process is solid and provides a practical foundation for an information theoretic approach for cover song identification.Lainakappeleiksi kutsutaan musiikkiesityksiä, jotka ovat eri esittäjän tekemiä uusia tulkintoja kappaleen alkuperäisen esittäjän tekemästä versiosta. Toisinaan lainakappaleet voivat olla hyvinkin samanlaisia alkuperäisversioiden kanssa, toisinaan versioilla saattaa olla vain nimellisesti yhtäläisyyksiä. Ihmisille lainakappaleiden tunnistaminen on yleensä helppoa, jos alkuperäisesitys on tuttu. Lainakappaleiden automaattinen, algoritmeihin perustuva tunnistaminen on kuitenkin huomattavasti haastavampi ongelma, eikä täysin tyydyttäviä ratkaisuja ole vielä esitetty. Ongelman ratkaisulla olisi useita tutkimuksellisesti ja kaupallisesti potentiaalisia sovelluskohteita, kuten esimerkiksi plagioinnin automaattinen tunnistaminen. Väitöskirjassa lainakappeleiden automaattista tunnistamista käsitellään informaatioteoreettisesta lähtökohdasta. Tutkimuksessa selvitetään, pystytäänkö kappaleiden sisältämää tonaalista samanlaisuutta mittaamaan siten, että sen perusteella voidaan todeta eri esitysten olevan pohjimmiltaan saman sävellyksen eri tulkintoja. Samanlaisuuden mittaamisessa hyödynnetään tiedontiivistysalgoritmeihin perustuvaa samanlaisuusmetriikkaa, jota varten musiikkikappaleista pitää pystyä erottamaan ja esittämään sen sävellyksellisesti yksilöivimmät piirteet. Tutkimus tehdään laajalla aineistolla audiomuotoista populaarimusiikkia. Väitöstutkimus käy läpi useita tutkimusongelman eri vaiheita lähtien signaalidatan käsittelemiseen liittyvistä parametreista, edeten siihen miten signaalista erotettu esitysmuoto saadaan muunnettua merkkijonomuotoiseksi siten, että prosessin tulos edelleen kuvaa kappaleen keskeisiä musiikillisia piirteitä, ja miten saatua merkkijonodataa voidaan vielä jatkokäsitellä tunnistamisen parantamiseksi. Tämän ohella väitöksessä tutkitaan, miten kappaleiden erilaiset musiikilliset eroavaisuudet (tempo, sävellaji, sovitukset) vaikuttavat tunnistamiseen ja miten näiden eroavaisuuksien vaikutus mittaamisessa voidaan minimoida. Tutkimuksen kohteena on myös yleisimpien tiedontiivistysalgoritmien soveltuvuus mittausmenetelmänä käsiteltävään ongelmaan. Näiden lisäksi tutkimus esittelee, miten samasta kappaleesta irrotettuja useita erilaisia esitysmuotoja voidaan yhdistää paremman tunnistamistarkkuuden saavuttamiseksi. Lopputuloksena väitöskirja esittelee tiedontiivistystä hyödyntävän järjestelmän lainakappaleiden tunnistamiseen ja käsittelee sen keskeiset vahvuudet ja heikkoudet. Tutkimuksen tuloksena arvioidaan myös mitkä asiat tekevät lainakappaleiden automaattisesta tunnistamisesta niin haastavan ongelman kuin mitä se on

    Pattern Recognition

    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    Convex Mathematical Programs for Relational Matching of Object Views

    Automatic recognition of objects in images is a difficult and challenging task in computer vision which has been tackled in many different ways. Based on the powerful and widely used concept to represent objects and scenes as relational structures, the problem of graph matching, i.e. to find correspondences between two graphs is a part of the object recognition problem. Belonging to the field of combinatorial optimization graph matching is considered to be one of the most complex problems in computer vision: It is known to be NP-complete in the general case. In this thesis, two novel approaches to the graph matching problem are proposed and investigated. They are based on recent progress in the mathematical literature on convex programming. Starting out from describing the desired matchings by suitable objective functions in terms of binary variables, relaxations of combinatorial constraints and an adequate adaption of the objective function lead to continuous convex optimization problems which can be solved without parameter tuning and in polynomial time. A subsequent post-processing step results in feasible, sub-optimal combinatorial solutions to the original decision problem. In the first part of this thesis, the connection between specific graph-matching problems and the quadratic assignment problem is explored. In this case, the convex relaxation leads to a convex quadratic program , which is combined with a linear program for post-processing. Conditions under which the quadratic assignment representation is adequate from the computer vision point of view are investigated, along with attempts to relax these conditions by modifying the approach accordingly. The second part of this work focuses directly on the matching of subgraphs -- representing a model -- to a considerably larger scene graph. A bipartite matching is extended with a quadratic regularization term to take into account relations within each set of vertices. Based on this convex relaxation, post-processing and the application to computer vision are investigated and discussed. Numerical experiments reveal both the power and the limitations of the approach. For problems of sizes which occur in applications the approach is quite reasonable and often the combinatorial optimal solution is found. For larger instances the intrinsic combinatorial nature of the problem comes out and leads to sub-optimal solutions which, however, are still good