159 research outputs found

    A survey of frequent subgraph mining algorithms

    Get PDF

    Scripting Language for Java Source Code Recognition

    Get PDF
    AbstractThis paper presents general results on the Java source code snippet detection problem. We propose the tool which uses graph and subgraph isomorphism detection. A number of solutions for all of these tasks have been proposed in the literature. However, although that all these solutions are really fast, they compare just the constant static trees. Our solution offers to enter an input sample dynamically with the Scripthon language while preserving an acceptable speed. We used several optimizations to achieve very low number of comparisons during the matching algorithm

    A lightweight, graph-theoretic model of class-based similarity to support object-oriented code reuse.

    Get PDF
    The work presented in this thesis is principally concerned with the development of a method and set of tools designed to support the identification of class-based similarity in collections of object-oriented code. Attention is focused on enhancing the potential for software reuse in situations where a reuse process is either absent or informal, and the characteristics of the organisation are unsuitable, or resources unavailable, to promote and sustain a systematic approach to reuse. The approach builds on the definition of a formal, attributed, relational model that captures the inherent structure of class-based, object-oriented code. Based on code-level analysis, it relies solely on the structural characteristics of the code and the peculiarly object-oriented features of the class as an organising principle: classes, those entities comprising a class, and the intra and inter-class relationships existing between them, are significant factors in defining a two-phase similarity measure as a basis for the comparison process. Established graph-theoretic techniques are adapted and applied via this model to the problem of determining similarity between classes. This thesis illustrates a successful transfer of techniques from the domains of molecular chemistry and computer vision. Both domains provide an existing template for the analysis and comparison of structures as graphs. The inspiration for representing classes as attributed relational graphs, and the application of graph-theoretic techniques and algorithms to their comparison, arose out of a well-founded intuition that a common basis in graph-theory was sufficient to enable a reasonable transfer of these techniques to the problem of determining similarity in object-oriented code. The practical application of this work relates to the identification and indexing of instances of recurring, class-based, common structure present in established and evolving collections of object-oriented code. A classification so generated additionally provides a framework for class-based matching over an existing code-base, both from the perspective of newly introduced classes, and search "templates" provided by those incomplete, iteratively constructed and refined classes associated with current and on-going development. The tools and techniques developed here provide support for enabling and improving shared awareness of reuse opportunity, based on analysing structural similarity in past and ongoing development, tools and techniques that can in turn be seen as part of a process of domain analysis, capable of stimulating the evolution of a systematic reuse ethic

    Query engine of novelty in video streams

    Get PDF
    Prior research on novelty detection has primarily focused on algorithms to detect novelty for a given application domain. Effective storage, indexing and retrieval of novel events (beyond detection) are largely ignored as a problem in itself. In light of the recent advances in counter-terrorism efforts and link discovery initiatives, the need for effective data management of novel events assumes apparent importance. Automatically detecting novel events in video data streams is an extremely challenging task. The aim of this thesis is to provide evidence to the fact that the notion of novelty in video as perceived by a human is extremely subjective and therefore algorithmically illdefined. Though it comes as no surprise that current machine-based parametric learning systems to accurately mimic human novelty perception are far from perfect such systems have recently been very successful in exhaustively capturing novelty in video once the novelty function is well-defined by a human expert. So, how truly effective are these machine based novelty detection systems as compared to human novelty detection? In this paper we outline an experimental evaluation of the human vs machine based novelty systems in terms of qualitative performance. We then quantify this evaluation using a variety of metrics based on location of novel events, number of novel events found in the video, etc. We begin by describing a machine-based system for detecting novel events in video data streams. We then discuss the issues of designing an indexing-strategy or Manga (comic-book representation is termed as manga in Japanese) to effectively determine the most-representative novel frames for a video sequence. We then evaluate the performance of machine-based novelty detection system against human novelty detection and present the results. The distance metrics we suggest for novelty comparison may eventually aide a variety of end-users to effectively drive the indexing, retrieval and analysis of large video databases. It should also be noted that the techniques we describe in this paper are based on low-level features extracted from video such as color, intensity and focus of attention. The video processing component does not include any semantic processing such as object detection in video for this framework. We conjecture that such advances, though beyond the scope of this particular paper, would undoubtedly benefit the machine-based novelty detection systems and experimentally validate this. We believe that developing a novelty detection system that works in conjunction with the human expert will lead to a more user-centered data mining approach for such domains. JPEG 2000 is a new method of compressing images better than other image formats such as JPEG, GIF, PNG, etc. The main reason this format is in need for investigation is it allows metadata to be embedded within the image itself. The types of data can essentially be anything such as text, audio, video, images, etc. Currently image annotations are stored and collected side by side. Even though this method is very common, it brings up a lot of risks and flaws. Imagine if medical images were annotated by doctors to describe a tumor within the brain, then suddenly some of the annotations are lost. Without these annotations, the images itself would be useless. By embedding these annotations within the image will guarentee that the description and the image will never be seperated. The metadata embedded within the image has no influence to the image iteself. In this thesis we initially develop a metric to index novelty by comparing it to traditional indexing techniques and to human perception. In the second phase of this thesis, we will investigate the new emerging technology of JPEG 2000 and show that novelty stored in this format will outperform traditional image structures. One of the contributions this thesis is making is to develop metrics to measure the performance and quality between the query results of JPEG 2000 and traditional image formats. Since JPEG 2000 is a new technology, there are no existing metrics to measure this type of performance with traditional images

    Frequent subgraph mining algorithms on weighted graphs

    Get PDF
    This thesis describes research work undertaken in the field of graph-based knowledge discovery (or graph mining). The objective of the research is to investigate the benefits that the concept of weighted frequent subgraph mining can offer in the context of the graph model based classification. Weighted subgraphs are graphs where some of the vertexes/edges are considered to be more significant than others. How to discover frequent sub-structures with different strengths is the main issue to be resolved in this thesis. The main approach to addressing this issue is to integrate weight constraints into the frequent subgraph mining process. It is suggested that the utilization of weighted frequent subgraph mining generates more discriminate and significant subgraphs, which will have application in, for example, the classification and clustering of graph data

    Efficient Similarity Search in Structured Data

    Get PDF
    Modern database applications are characterized by two major aspects: the use of complex data types with internal structure and the need for new data analysis methods. The focus of database users has shifted from simple queries to complex analyses of the data, known as knowledge discovery in databases. Important tasks in this area are the grouping of data objects (clustering), the classification of new data objects or the detection of exceptional data objects (outlier detection). Most algorithms for solving those problems are based on similarity search in databases. This makes efficient similarity search in large databases of structured objects an important basic operation for modern database applications. In this thesis we develop efficient methods for similarity search in large databases of structured data and improve the efficiency of existing query processing techniques. For the data objects, only a tree or graph structure is assumed which can be extended with arbitrary attribute information. Starting with an analysis of the demands from two example applications, several important requirements for similarity measures are identified. One aspect is the adaptability of the similarity search method to the requirements of the user and the application domain. This can even imply a change of the similarity measure between two successive queries of the same user. An explanation component which makes clear why objects are considered similar by the system is a necessary precondition for a purposeful adaption of the measure. Consequently, the edit distance, well-known from string processing, is a common similarity measure for graph structured objects. Its feature to allow a visualization of corresponding substructures and the possibility to weight single operations are the reason for this popularity. But it turns out that the edit distance and similar measures for tree structures are computationally extremely complex which makes them unsuitable for today's large and even growing databases. Therefore, we develop a multi-step query processing architecture which reduces the number of necessary distance calculations significantly. This is achieved by employing suitable filter methods. Furthermore, we show that by easing certain restrictions on the similarity measure, a significant performance gain can be obtained without reducing the quality of the measure. To achieve this, matchings of substructures (vertices or edges) of the data objects are determined. An additional cost function for those matchings allows to derive a similarity measure for structured data, called the edge matching distance, from the cost optimal matching of the substructures. But even for this new similarity measure, efficiency can be improved significantly by using a multi-step query processing approach. This allows the use of the edge matching distance for knowledge discovery applications in large databases. Within the thesis, the properties of our new similarity search methods are proved both theoretically and through experiments.Moderne Datenbankanwendungen werden vor allem durch zwei wesentliche Aspekte charakterisiert. Dies ist zum einen die Verwendung komplexer Datentypen mit interner Struktur und zum anderen die Notwendigkeit neuer Recherchemöglichkeiten. Der Fokus bei der Datenbankbenutzung hat sich von einfachen Anfragen hin zu komplexen Analysen des Datenbestandes, dem sogenannten Knowledge-Discovery in Datenbanken, entwickelt. Wichtige Analysetechniken in diesem Bereich sind unter anderem die Gruppierung der Daten in Teilmengen (Clustering), die Klassifikation neuer Datenobjekte im Bezug auf den vorhandenen Datenbestand und das Erkennen von Ausreißern in den Daten (Outlier-Identifikation). Die Basis für die meisten Verfahren zur Lösung dieser Aufgaben bildet dabei die Bestimmung der Ähnlichkeit von Datenbankobjekten. Die effiziente Ähnlichkeitssuche in großen Datenbanken strukturierter Objekte ist daher eine wichtige Basisoperation für moderne Datenbankanwendungen. In dieser Doktorarbeit werden daher effiziente Verfahren für die Ähnlichkeitssuche in großen Mengen strukturierter Objekte entwickelt, bzw. die Effizienz vorhandener Verfahren deutlich zu verbessert. Dabei wird lediglich eine baum- oder allgemein graphartige innere Struktur der Datenobjekte vorausgesetzt, die durch beliebige Attribute erweitert wird. Ausgehend von einer Analyse der Anforderungen an Ähnlichkeitssuchverfahren in zwei Beispielsanwendungen aus dem Bereich der Bildsuche und des Proteindockings, wurden mehrere wichtige Aspekte der Ähnlichkeitssuche identifiziert. Ein erster Aspekt ist, das Maß für die Ähnlichkeit für den Benutzer anpassbar zu gestalten, da der zugrundeliegende Ähnlichkeitsbegriff sowohl benutzer- als auch situationsabhängig ist, was bis hin zur Änderung des Ähnlichkeitsbegriffs zwischen zwei aufeinanderfolgenden Anfragen gehen kann. Voraussetzung für eine zielgerichtete Anpassung des Ähnlichkeitsbegriffs ist dabei eine Erklärungskomponente, welche dem Benutzer das Zustandekommen eines Ähnlichkeitswertes verdeutlicht. Die aus der Stringverarbeitung bekannte Edit-Distanz ist deshalb ein weit verbreitetes Maß für die Ähnlichkeit von graphstrukturierten Objekten, da sie eine Gewichtung einzelner Operationen erlaubt und durch eine Zuordnung von Teilobjekten aus den zu vergleichenden Strukturen eine Erklärungskomponente liefert. Es zeigt sich jedoch, dass die Bestimmung der Edit-Distanz und vergleichbarer Ähnlichkeitsmaße für Baum- oder Graphstrukturen extrem zeitaufwendig ist. Es wird daher zunächst ein mehrstufiges Anfragebearbeitungsmodell entwickelt, welches durch geeignete Filterschritte die Anzahl der notwendigen Distanzberechnungen massiv reduziert und so die Geschwindigkeit der Anfragebearbeitung deutlich steigert bzw. erst für große Datenmengen akzeptabel macht. Im nächsten Schritt wird aufgezeigt, wie sich durch Lockerung einiger Bedingungen für das Ähnlichkeitsmaß deutliche Geschwindigkeitssteigerungen erreichen lassen, ohne Einbußen bezüglich der Qualität der Anfrageergebnisse hinnehmen zu müssen. Dazu werden Paarungen von Teilstrukturen (Knoten oder Kanten) der zu vergleichenden Objekte bestimmt, die zusätzlich mittels einer Kostenfunktion gewichtet werden. Eine bezüglich dieser Kostenfunktion optimale Paarung aller Teilstrukturen stellt dann ein Maß für die Ähnlichkeit der Vergleichsobjekte dar, die sogenannte "edge matching distance". Es zeigt sich jedoch, dass auch für dieses neue Ähnlichkeitsmaß eine mehrstufige Anfragebearbeitung zusammen mit entsprechenden, neuartigen Filtermethoden eine erhebliche Performanzsteigerung erlaubt. Diese stellt die Voraussetzung für die Anwendung der Verfahren im Rahmen des Knowledge-Discovery in großen Datenbanken dar. Dabei werden die genannten Eigenschaften der neu entwickelten Verfahren sowohl theoretisch als auch mittels praktischer Experimente belegt

    Pattern-Based Approach to Table Extraction

    Get PDF
    International audienceIn this paper, we address a client-driven approach to automatically extract information content within the table in document images. We start with a graph-based representation of a set of key-fields selected by clients and perform graph mining in a document in order to learn them to produce a model. Such models are aimed to use to extract information content in the absence of clients. To avoid NP-hard general problem, our graph matching is based on relation assignment to see whether pairs of nodes are semantically identical. We have validated the concept by using a real-world industrial problem

    Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity

    Get PDF
    The comparative analysis of protein structure data is a central aspect of structural bioinformatics. Drawing upon structural information allows the inference of function for unknown proteins even in cases where no apparent homology can be found on the sequence level. Regarding the function of an enzyme, the overall fold topology might less important than the specific structural conformation of the catalytic site or the surface region of a protein, where the interaction with other molecules, such as binding partners, substrates and ligands occurs. Thus, a comparison of these regions is especially interesting for functional inference, since structural constraints imposed by the demands of the catalyzed biochemical function make them more likely to exhibit structural similarity. Moreover, the comparative analysis of protein binding sites is of special interest in pharmaceutical chemistry, in order to predict cross-reactivities and gain a deeper understanding of the catalysis mechanism. From an algorithmic point of view, the comparison of structured data, or, more generally, complex objects, can be attempted based on different methodological principles. Global methods aim at comparing structures as a whole, while local methods transfer the problem to multiple comparisons of local substructures. In the context of protein structure analysis, it is not a priori clear, which strategy is more suitable. In this thesis, several conceptually different algorithmic approaches have been developed, based on local, global and semi-global strategies, for the task of comparing protein structure data, more specifically protein binding pockets. The use of graphs for the modeling of protein structure data has a long standing tradition in structural bioinformatics. Recently, graphs have been used to model the geometric constraints of protein binding sites. The algorithms developed in this thesis are based on this modeling concept, hence, from a computer scientist's point of view, they can also be regarded as global, local and semi-global approaches to graph comparison. The developed algorithms were mainly designed on the premise to allow for a more approximate comparison of protein binding sites, in order to account for the molecular flexibility of the protein structures. A main motivation was to allow for the detection of more remote similarities, which are not apparent by using more rigid methods. Subsequently, the developed approaches were applied to different problems typically encountered in the field of structural bioinformatics in order to assess and compare their performance and suitability for different problems. Each of the approaches developed during this work was capable of improving upon the performance of existing methods in the field. Another major aspect in the experiments was the question, which methodological concept, local, global or a combination of both, offers the most benefits for the specific task of protein binding site comparison, a question that is addressed throughout this thesis
    corecore