    MeshGit: Diffing and Merging Polygonal Meshes

    This paper presents MeshGit, a practical algorithm for diffing and merging polygonal meshes. Inspired by version control for text editing, we introduce the mesh edit distance as a measure of the dissimilarity between meshes. This distance is defined as the minimum cost of matching the vertices and faces of one mesh to those of another. We propose an iterative greedy algorithm to approximate the mesh edit distance, which scales well with model complexity, providing a practical solution to our problem. We translate the mesh correspondence into a set of mesh editing operations that transforms the first mesh into the second. The editing operations can be displayed directly to provide a meaningful visual difference between meshes. For merging, we compute the difference between two versions and their common ancestor, as sets of editing operations. We robustly detect conflicting operations, automatically apply non-conflicting edits, and allow the user to choose how to merge the conflicting edits. We evaluate MeshGit by diffing and merging a variety of meshes and find it to work well for all

    Malware Classification based on Call Graph Clustering

    Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

    Un algorithme Hongrois pour l'appariement de graphes avec correction d'erreurs

    International audienceBipartite graph matching algorithms become more and more popular to solve error-correcting graph matching problems and to approximate the graph edit distance of two graphs. However, the memory requirements and execution times of this method are respectively proportional to (n + m) 2 and (n + m) 3 where n and m are the order of the graphs. Subsequent developments reduced these complexities. However , these improvements are valid only under some constraints on the parameters of the graph edit distance. We propose in this paper a new formulation of the bipartite graph matching algorithm designed to solve efficiently the associated graph edit distance problem. The resulting algorithm requires O(nm) memory space and O(min(n, m) 2 max(n, m)) execution times.L'appariement de graphes biparti deviennent de plus en plus populaires pour résoudre des problèmes d'appariement de graphes avec correction d'erreurs et pour approximer la distance d'édition sur graphes. Cependant, les exigences en mémoire et temps de calcul de cette méthode sont respectivement proportionnels à (n + m)^2 et (n + m)^3 où n et m représentent la taille des deux graphes. Des développements ultérieurs ont réduit ces complexités. Cependant, ces améliorations ne sont valables que sous certaines contraintes sur les paramètres de la distance d'édition. Nous proposons dans cet article une nouvelle formulation de l'algorithme Hongrois conçu pour résoudre efficacement le problème de distance d'édition associé. L'algorithme résultat nécessite un espace mémoire O (nm) et des temps d'exécution O (min (n, m)^2 max (n, m))

    Learning the Consensus of Multiple Correspondences between Data Structures

    En aquesta tesi presentem un marc de treball per aprendre el consens donades múltiples correspondències. S'assumeix que les diferents parts involucrades han generat aquestes correspondències per separat, i el nostre sistema actua com un mecanisme que calibra diferents característiques i considera diferents paràmetres per aprendre les millors assignacions i així, conformar una correspondència amb la major precisió possible a costa d'un cost computacional raonable. Aquest marc de treball de consens és presentat en una forma gradual, començant pels desenvolupaments més bàsics que utilitzaven exclusivament conceptes ben definits o únicament un parell de correspondències, fins al model final que és capaç de considerar múltiples correspondències, amb la capacitat d'aprendre automàticament alguns paràmetres de ponderació. Cada pas d'aquest marc de treball és avaluat fent servir bases de dades de naturalesa variada per demostrar efectivament que és possible tractar diferents escenaris de matching. Addicionalment, dos avanços suplementaris relacionats amb correspondències es presenten en aquest treball. En primer lloc, una nova mètrica de distància per correspondències s'ha desenvolupat, la qual va derivar en una nova estratègia per a la cerca de mitjanes ponderades. En segon lloc, un marc de treball específicament dissenyat per a generar correspondències al camp del registre d'imatges s'ha modelat, on es considera que una de les imatges és una imatge completa, i l'altra és una mostra petita d'aquesta. La conclusió presenta noves percepcions de com el nostre marc de treball de consens pot ser millorada, i com els dos desenvolupaments paral·lels poden convergir amb el marc de treball de consens.En esta tesis presentamos un marco de trabajo para aprender el consenso dadas múltiples correspondencias. Se asume que las distintas partes involucradas han generado dichas correspondencias por separado, y nuestro sistema actúa como un mecanismo que calibra distintas características y considera diferentes parámetros para aprender las mejores asignaciones y así, conformar una correspondencia con la mayor precisión posible a expensas de un costo computacional razonable. El marco de trabajo de consenso es presentado en una forma gradual, comenzando por los acercamientos más básicos que utilizaban exclusivamente conceptos bien definidos o únicamente un par de correspondencias, hasta el modelo final que es capaz de considerar múltiples correspondencias, con la capacidad de aprender automáticamente algunos parámetros de ponderación. Cada paso de este marco de trabajo es evaluado usando bases de datos de naturaleza variada para demostrar efectivamente que es posible tratar diferentes escenarios de matching. Adicionalmente, dos avances suplementarios relacionados con correspondencias son presentados en este trabajo. En primer lugar, una nueva métrica de distancia para correspondencias ha sido desarrollada, la cual derivó en una nueva estrategia para la búsqueda de medias ponderadas. En segundo lugar, un marco de trabajo específicamente diseñado para generar correspondencias en el campo del registro de imágenes ha sido establecida, donde se considera que una de las imágenes es una imagen completa, y la otra es una muestra pequeña de ésta. La conclusión presenta nuevas percepciones de cómo nuestro marco de trabajo de consenso puede ser mejorada, y cómo los dos desarrollos paralelos pueden converger con éste.In this work, we present a framework to learn the consensus given multiple correspondences. It is assumed that the several parties involved have generated separately these correspondences, and our system acts as a mechanism that gauges several characteristics and considers different parameters to learn the best mappings and thus, conform a correspondence with the highest possible accuracy at the expense of a reasonable computational cost. The consensus framework is presented in a gradual form, starting from the most basic approaches that used exclusively well-known concepts or only two correspondences, until the final model which is able to consider multiple correspondences, with the capability of automatically learning some weighting parameters. Each step of the framework is evaluated using databases of varied nature to effectively demonstrate that it is capable to address different matching scenarios. In addition, two supplementary advances related on correspondences are presented in this work. Firstly, a new distance metric for correspondences has been developed, which lead to a new strategy for the weighted mean correspondence search. Secondly, a framework specifically designed for correspondence generation in the image registration field has been established, where it is considered that one of the images is a full image, and the other one is a small sample of it. The conclusion presents insights of how our consensus framework can be enhanced, and how these two parallel developments can converge with it

    Cerulean: A hybrid assembly using high throughput short and long reads

    Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats. Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    Online learning the consensus of multiple correspondences between sets.

    When several subjects solve the assignment problem of two sets, differences on the correspondences computed by these subjects may occur. These differences appear due to several factors. For example, one of the subjects may give more importance to some of the elements’ attributes than another subject. Another factor could be that the assignment problem is computed through a suboptimal algorithm and different non-optimal correspondences can appear. In this paper, we present a consensus methodology to deduct the consensus of several correspondences between two sets. Moreover, we also present an online learning algorithm to deduct some weights that gauge the impact of each initial correspondence on the consensus. In the experimental section, we show the evolution of these parameters together with the evolution of the consensus accuracy. We observe that there is a clear dependence of the learned weights with respect to the quality of the initial correspondences. Moreover, we also observe that in the first iterations of the learning algorithm, the consensus accuracy drastically increases and then stabilises

    Approximating the Graph Edit Distance with Compact Neighborhood Representations

    The graph edit distance is used for comparing graphs in various domains. Due to its high computational complexity it is primarily approximated. Widely-used heuristics search for an optimal assignment of vertices based on the distance between local substructures. While faster ones only consider vertices and their incident edges, leading to poor accuracy, other approaches require computationally intense exact distance computations between subgraphs. Our new method abstracts local substructures to neighborhood trees and compares them using efficient tree matching techniques. This results in a ground distance for mapping vertices that yields high quality approximations of the graph edit distance. By limiting the maximum tree height, our method supports steering between more accurate results and faster execution. We thoroughly analyze the running time of the tree matching method and propose several techniques to accelerate computation in practice. We use compressed tree representations, recognize redundancies by tree canonization and exploit them via caching. Experimentally we show that our method provides a significantly improved trade-off between running time and approximation quality compared to existing state-of-the-art approaches

    Graph médian généralisé via des minimisations alternées.

    International audienceComputing a graph prototype may constitute a core element for clustering or classification tasks. However, its computation is an NP-Hard problem, even for simple classes of graphs. In this paper, we propose an efficient approach based on block coordinate descent to compute a generalized median graph from a set of graphs. This approach relies on a clear definition of the optimization process and handles labeling on both edges and nodes. This iterative process optimizes the edit operations to perform on a graph alternatively on nodes and edges. Several experiments on different datasets show the efficiency of our approach.Calculer un graphe prototype peut constituer une étape centrale pour des méthodes de clustering ou de classification. Toutefois, ce calcul est NP-difficile même pour des classes de graphes simples. Nous proposons dans ce papier une approche efficace basée sur une minimisation alternée pour calculer le graphe médian d'un ensemble. Cette approche s'appuie sur une définition claire du processus d'optimisation et inclue l'étiquetage à la fois des nœuds et des arêtes. Ce processus itératif optimise les opérations à effectuer alternativement sur les sommets et les arêtes. Plusieurs expériences sur des jeux de données différents montrent l'efficacité de notre approche

    Deriving Compact Test Suites for Telecommunication Software Using Distance Metrics

    This paper proposes a string edit distance based test selection method to generate compact test sets for telecommunications software. Following the results of previous research, a trace in a test set is considered to be redundant if its edit distance from others is less than a given parameter. The algorithm first determines the minimum cardinality of the target test set in accordance with the provided parameter, then it selects the test set with the highest sum of internal edit distances. The selection problem is reduced to an assignment problem in bipartite graphs