Search CORE

706 research outputs found

그래프 최적화 문제를 위한 점진적 유전 알고리즘

Author: 김진현
Publication venue: 서울대학교 대학원
Publication date: 01/08/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 문병로.A combinatorial optimization problem is an optimization problem having a discrete solution space. Lots of the graph problems belong to this category as graphs are discrete objects. Graphs are widely used in the various field and there are lots of real world combinatorial optimization problems which take the graphs as their input. For some of these problems, the magnitude of the solution space is exponential to the size of the problem, and thereby efficient space search algorithms are required to deal with them. Genetic algorithms are widely used to solve combinatorial optimization problems, and incremental genetic algorithms could be used to efficiently solve graph optimization problems.We define subproblems and solve them step by step instead of tackling the problems directly. A subproblem solved by an incremental genetic algorithm deals with a restriction of the original graph structure. The subproblems are solved in the intermediate steps and the size of the subproblem is gradually increased. We apply the same genetic algorithm to each subproblem, and it is initialized with the evolved population of the previous step. We propose incremental genetic algorithms for two different combinatorial optimization problemsthe subgraph isomorphism problem and graph cut optimization problem. We devise an optimal substructure on the subproblem sequence and explain how it is related to the optimality of the process, along with other related factors. We present graph expansion methodologies and vertex reordering schemes to define an appropriate sequence of subproblems. We combine the proposed incremental approach with a hybrid genetic algorithm for the subgraph isomorphism problem, and the algorithm was further developed for nearly perfect results. Based on our analysis, we also propose an incremental genetic algorithm to solve graph cut optimization problems. We tested the implementation of the algorithm on benchmark graph instances for the graph partitioning problem and the maximum cut problem. Through experiments, we investigate and analyze how the sequence of subproblems affects the search space landscape. The performance of a genetic algorithm makes an improvement when the incremental approach is applied with respect to an appropriate sequence of subproblems.Chapter I. Introduction 1 Chapter II. Incremental Genetic Algorithm 6 2.1 Overview and Traditional Applications 6 2.2 Application on Graph Optimization Problems 9 2.2.1 Formalization of the Incremental Process 9 2.2.2 Theoretical Background 12 2.2.3 Sequence of Subproblems 15 Chapter III. Subgraph Isomorphism Problem 19 3.1 Introduction 19 3.2 The Proposed Algorithm 21 3.2.1 The Structure of the Incremental Genetic Algorithm 21 3.2.2 Design Issues 25 3.2.3 Genetic Framework 28 3.3 Experimental Results 31 3.3.1 Dataset and Evaluation 31 3.3.2 Results and Discussions 33 3.3.3 Overall Results 39 3.4 Further Improvement 42 3.4.1 New Operators 43 3.4.2 Improvements by New Operators 45 3.4.3 Overall Result 46 Chapter IV. Graph Cut Optimization Problems 50 4.1 Introduction 50 4.2 The Proposed Algorithm 51 4.2.1 Subproblem Structure 51 4.2.2 Reordering Schemes 54 4.2.3 Genetic Framework 55 4.3 Experimental Results 57 4.3.1 Dataset and Evaluation 57 4.3.2 Results on Graph Partitioning Problem 58 4.3.3 Results on Maximum Cut Problem 66 4.3.4 Results on Problem Variants 70 Chapter V. Related Applications 75 5.1 Measuring Source Code Similarity with an Incremental Genetic Algorithm 75 5.1.1 Introduction 75 5.1.2 The Proposed System 76 5.1.3 Experimental Results 80 5.1.4 Discussion 88 5.2 Linear Ordering Problem and an Approximate Fitness Evaluation 88 5.2.1 Introduction 88 5.2.2 The Proposed Method 89 5.2.3 Experimental Results 91 Chapter VI. Conclusions 94 Bibliography 96 국문 초록 106Docto

SNU Open Repository and Archive

A lightweight, graph-theoretic model of class-based similarity to support object-oriented code reuse.

Author: MacLean Angus
Publication venue
Publication date: 31/01/2003
Field of study

The work presented in this thesis is principally concerned with the development of a method and set of tools designed to support the identification of class-based similarity in collections of object-oriented code. Attention is focused on enhancing the potential for software reuse in situations where a reuse process is either absent or informal, and the characteristics of the organisation are unsuitable, or resources unavailable, to promote and sustain a systematic approach to reuse. The approach builds on the definition of a formal, attributed, relational model that captures the inherent structure of class-based, object-oriented code. Based on code-level analysis, it relies solely on the structural characteristics of the code and the peculiarly object-oriented features of the class as an organising principle: classes, those entities comprising a class, and the intra and inter-class relationships existing between them, are significant factors in defining a two-phase similarity measure as a basis for the comparison process. Established graph-theoretic techniques are adapted and applied via this model to the problem of determining similarity between classes. This thesis illustrates a successful transfer of techniques from the domains of molecular chemistry and computer vision. Both domains provide an existing template for the analysis and comparison of structures as graphs. The inspiration for representing classes as attributed relational graphs, and the application of graph-theoretic techniques and algorithms to their comparison, arose out of a well-founded intuition that a common basis in graph-theory was sufficient to enable a reasonable transfer of these techniques to the problem of determining similarity in object-oriented code. The practical application of this work relates to the identification and indexing of instances of recurring, class-based, common structure present in established and evolving collections of object-oriented code. A classification so generated additionally provides a framework for class-based matching over an existing code-base, both from the perspective of newly introduced classes, and search "templates" provided by those incomplete, iteratively constructed and refined classes associated with current and on-going development. The tools and techniques developed here provide support for enabling and improving shared awareness of reuse opportunity, based on analysing structural similarity in past and ongoing development, tools and techniques that can in turn be seen as part of a process of domain analysis, capable of stimulating the evolution of a systematic reuse ethic

Open Access Institutional Repository at Robert Gordon University

부분 그래프 동형 사상 문제를 위한 점진적 유전 알고리즘의 조사

Author: HyukGeun Choi
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 문병로.그래프는 객체들의 관계를 표현하는 가장 대표적인 자료구조이고, 데이터가 그래프 형태로 표현되는 많은 연구분야에서 발생하는 핵심 문제들 중 하나가 바로 그래프 패턴 매칭이다. 그래프 패턴 매칭은 정점이나 간선들의 정보를 이용한 시멘틱 기반의 방법으로 정의할 수도 있지만, 일반적으로는 정점과 간선간의 관계만을 이용해 구조적으로 정의하고 이러한 패턴매칭은 부분그래프 동형사상으로 표현된다. 그동안 부분그래프 동형사상 문제를 풀기 위해 제안된 알고리즘들은 크게 두 가지 형태로 분류된다. 첫번째는 재귀적 퇴각검색 알고리즘을 기반으로 존재하는 모든 해를 정확하게 찾아내는 방법이다. 다만, 부분그래프 동형사상 문제는 대표적인 NP-완비군의 문제 중 하나이기 때문에 모든 순열을 하나씩 탐색하는 경우 수행시간이 문제의 크기에 따라 기하급수적으로 늘어나게 된다. 두번째는 유전 알고리즘을 비롯한 메타휴리스틱 알고리즘 기반의 근사적인 방법이다. 이들은 합리적인 시간 내에 좋은 품질의 해들을 찾아내지만 대부분의 알고리즘이 그 크고 복잡한 문제공간 전체를 다룰 수 있을 만큼의 탐색능력을 갖추지는 못하였다. 연산자나 지역 휴리스틱을 개선하여 알고리즘의 공간탐색능력을 직접적으로 향상시킬수도 있겠지만, 적합도 함수를 변경하거나 탐색전략 변경을 통해서도 크게 성능을 개선할 수 있다. 만약 원래 문제를 메타 휴리스틱의 탐색능력에 적합한 크기의 부분문제로 분할하고, 이 부분문제를 단계적으로 풀어간다면 보다 효율적으로 문제를 해결할 수 있다. 또한, 적합도 함수를 변경하여 공간을 보다 단순한 형태로 변환시킨다면 그 효과가 훨씬 더 커질것이다. 본 논문에서는 부분그래프 동형사상 문제가 이루는 문제공간의 특성을 분석하고 이에 어울리는 적합도 함수와 탐색전략을 바탕으로, 이 문제를 효율적으로 풀기 위한 유전알고리즘을 제안한다. 첫번째로, 부분그래프 동형사상 문제에 어울리는 새로운 적합도 함수를 소개하고, 연산자와 함께 생성되는 적합도 공간이 어떠한 형태로 변형되는지를 살펴본다. 우선, 기존 연구들에서 사용한 적합도 함수가 가지고 있던 문제점들을 검토하고 이를 해결하기 위해 부분그래프 동형사상의 정점의 차수 조건을 반영한 새로운 함수를 설계해서 기존의 함수와 결합한 다목적 적합도 함수를 제안한다. 이후, 실험을 통해서 지역 최적화 알고리즘과 결합했을 때 적합도 값들의 변화과정을 통해 새로운 적합도 함수의 특징들을 분석하 고 지역최적점들을 모아 적합도 함수와 해들의 평균거리를 이용한 상관관계를 통해 제안한 적합도 함수가 그리는 문제공간이 기존의 문제공간을 어떤 식으로 변형시키는지를 설명한다. 제안한 다목적 적합도 함수를 혼 합형 유전알고리즘에 적용한 결과를 기존 연구들의 결과들과 비교하여 제안한 다목적 적합도 함수가 유전알고리즘의 문제공간탐색과 최적화에 얼마나 도움을 주는지를 확인한다. 두번째로, 새롭게 설계된 문제공간을 효율적으로 탐색하기 위한 전략으로 점진적 유전 알고리즘을 소개하고 각 설계요소들이 알고리즘의 수행과정과 성능에 어떻게 반영되는지를 알아본다. 우선, 점진적 유전알고리즘에서 원 문제를 최적 부분구조를 갖는 일련의 연속적인 부분문제들로 분할한 후 각 부분문제를 혼합형 유전알고리즘을 통해 풀고 얻어진 해들을 확장하여 다음 부분문제의 초기해로 사용하는 방법을 설명하고, 이러한 과정을 순차적으로 적용하여 작은 부분문제의 해를 원래 문제의 해로 발전시켜 원 문제의 답을 얻는 과정을 보인다. 이후, 점진적 유전알고리즘을 진행하는 과정에서 원래의 문제를 분할하는 방법과 부분문제들의 연속성을 설정하는 부분이 알고리즘 전체 성능에 어느 정도 영향을 미치는지를실험을통해 분석한다.최종적으로 랜덤그래프에 대해서 제안한 점진적 혼합 유전 알고리즘의 성능과 기존의 혼합형 유전알고리즘의 성능을 비교 분석하고,기존의 알고리즘들로는 불가능했던 사이즈가 큰 실제 데이터들에 대해서도 좋은 성능을 보임으로써 확장성까지 갖춘 것을 보여준다.Graph is the most representative data structure for modeling the relationships of objects and graph pattern matching is one of the key problems that arise in many applications where data is expressed in the form of graph. Although graph pattern matching can be defined by a semantic-based method using information such as the labels of vertices or edges, it is generally defined by a structure-based method using only the relationships between vertices and edges, and such pattern matching is represented by the subgraph isomorphism. The algorithms proposed so far to solve the subgraph isomorphism problem are classified into two types. The first is an exact method to find out all existing solutions based on the recursive backtracking algorithm. However, since the subgraph isomorphism problem is NP-complete, if all the permutations are searched one by one, the running time increases exponentially according to the size of the problem. The second is an approximation method based on metaheuristic such as genetic algorithm. They are able to find good quality solutions within a reasonable amount of time, but most algorithms do not have enough search capability to cover the large and complex problem space of this problem. The search capability of a metaheuristic algorithm can be improved by designing better operators or local heuristics, but it is possible to improve the performance greatly by changing the fitness function and by reforming the search strategy. If the original problem is divided into subproblems with the suitable size for the search capability of a metaheuristic algorithm, and each subproblem is solved step by step, the problem can be solved more efficiently. Also, if we change the fitness function to transform the fitness landscape more convex, the effect of an incremental algorithm will be much greater. In this thesis, we propose an efficient incremental hybrid genetic algorithm to solve the subgraph isomorphism problem. First, we introduce a new fitness function which is suitable for the problem of the subgraph isomorphism problem and examine how the fitness landscape generated with the operator is transformed. We introduce a multi-objective fitness function by designing a new function reflecting the degree constraint of the subgraph isomorphism. Through the experiments, we analyze the characteristics of the new fitness function combining with the local optimization algorithm, investigate the correlation between the fitness value and the average distance of the local optima to explain how the new fitness function transforms the fitness landscape of the subgraph isomorphism problem. We compare the results of the hybrid genetic algorithm applying the proposed multi-objective fitness function with that of the conventional genetic algorithm and show how the proposed fitness function facilitates the search capability of a genetic algorithm. Second, we introduce the new efficient search strategy, the incremental genetic algorithm, and how the design issues are reflected in the process and performance of the algorithm. We divide the original problem into a sequence of successive subproblems with the optimal substructure, solve each subproblem through the hybrid genetic algorithm, and then extend the solutions obtained for the initial solutions of the next subproblem. This process is applied sequentially to develop the solutions of the small problem to those of the original problem. Through the experiments, we discuss how to divide the original problem into successivee subproblems and analyze how components of the sequence affect the performance of the incremental genetic algorithm. We also compare the performance of the incremental hybrid genetic algorithm with that of the previous hybrid genetic algorithm through the random graph instances, and show a good scalability the proposed algorithm through the experimental results obtained for real data with a large size that was impossible with existing algorithms.I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II. Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Graph Pattern Matching and Isomorphism . . . . . . . . . . 5 2.2 Subgraph Isomorphism and Related Problems . . . . . . . . 7 2.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Representation . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Fitness Function . . . . . . . . . . . . . . . . . . . 13 2.3.4 Crossover . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.5 Mutation . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.6 Hybrid Genetic Algorithm . . . . . . . . . . . . . . 15 III. Inspecting Fitness Function of Subgraph Isomorphism Problem . . . . . . . . . . . . . . . . . . . . 16 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Conventional Fitness Function . . . . . . . . . . . . . . . . 18 3.3 Multi-objective Fitness Function . . . . . . . . . . . . . . . 19 3.4 Local Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Experimental Setting . . . . . . . . . . . . . . . . . 24 3.5.2 Comparison of Single and Multi-objective Function . 25 3.5.3 Global Convexity of the Multi-objective Fitness Landscape . . . . . . . . . . 27 3.5.4 Hybrid Genetic Algorithm . . . . . . . . . . . . . . 30 IV. Incremental Hybrid Genetic Algorithm . . . . . . . . . . . . 34 4.1 Incremental Process . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Design Schemes . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Vertex Reordering . . . . . . . . . . . . . . . . . . 40 4.3.2 Stopping Criterion . . . . . . . . . . . . . . . . . . 41 4.3.3 Expansion Size . . . . . . . . . . . . . . . . . . . . 42 4.4 Genetic Frameworks . . . . . . . . . . . . . . . . . . . . . 42 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 45 4.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . 45 4.5.2 Real World Data . . . . . . . . . . . . . . . . . . . 57 V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 국문초록 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Docto

SNU Open Repository and Archive

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

Author: Ekhtiarzadeh Masoud
Parsa Saeed
Ramezani Mohammad
Roy Chanchal
Zakeri-Nasrabadi Morteza
Publication venue
Publication date: 28/06/2023
Field of study

Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

arXiv.org e-Print Archive

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

Crossref

PubMed Central

Malware similarity and a new fuzzy hash: Compound Code Block Hash (CCBHash)

Author: López-Muñoz Francisco Javier
Onieva-González José Antonio
Pérez-Jiménez Pablo
Publication venue: Elsevier
Publication date: 01/01/2024
Field of study

In the last few years, malware analysis has become increasingly important due to the rise of sophisticated cyberattacks. One of the objectives of this cybersecurity branch is to find similarities between different files or functions used by malware programmers, thus allowing malware detection, classification and even attribution in a timely manner. In this article we survey the state of the art in this area, reviewing the different techniques that can be applied to the field, with the objective of studying similarity, and therefore detecting, classifying and attributing malware samples. We have developed a fuzzy hash capable of characterizing malware by generating an easily comparable and storable signature of its functions. Since our goal is to detect these similarities in huge amounts of data within a reasonable time-frame, the size of the hash must be limited while retaining as much information as possible.Funding for open access charge: Universidad de Málaga / CBU

Repositorio Institucional Universidad de Málaga

A novel graph-based method for targeted ligand-protein fitting

Author: Hannaford Gareth James
Publication venue: University of Bedfordshire
Publication date: 01/08/2008
Field of study

A thesis submitted to the Faculty of Creative Arts, Technologies & Science, University of Bedfordshire, in partial & fulfilment of the requirements for the degree of Master of Philosophy.The determination of protein binding sites and ligand -protein fitting are key to understanding the functionality of proteins, from revealing which ligand classes can bind or the optimal ligand for a given protein, such as protein/ drug interactions. There is a need for novel generic computational approaches for representation of protein-ligand interactions and the subsequent prediction of hitherto unknown interactions in proteins where the ligand binding sites are experimentally uncharacterised. The TMSite algorithms read in existing PDB structural data and isolate binding sites regions and identifies conserved features in functionally related proteins (proteins that bind the same ligand). The Boundary Cubes method for surface representation was applied to the modified PDB file allowing the creation of graphs for proteins and ligands that could be compared and caused no loss of geometric data. A method is included for describing binding site features of individual ligands conserved in terms of spatial relationships allowed identification of 3D motifs, named fingerprints, which could be searched for in other protein structures. This method combine with a modification of the pocket algorithm allows reduced search areas for graph matching. The methods allow isolation of the binding site from a complexed protein PDB file, identification of conserved features among the binding sites of individual ligand types, and search for these features in sequence data. In terms of spatial conservation create a fingerprint ofthe binding site that can be sought in other proteins of/mown structure, identifYing putative binding sites. The approach offers a novel and generic method for the identification of putative ligand binding sites for proteins for which there is no prior detailed structural characterisation of protein/ ligand interactions. It is unique in being able to convert PDB data into graphs, ready for comparison and thus fitting of ligand to protein with consideration of chemical charge and in the future other chemica! properties

University of Bedfordshire Repository