Search CORE

8 research outputs found

Recommended from our members

Source-code queries with graph databases - With application to programming language usage and evolution

Author: Mycroft A
Urma RG
Publication venue: Science of Computer Programming
Publication date: 01/01/2015
Field of study

Program querying and analysis tools are of growing importance, and occur in two main variants. Firstly there are source-code query languages which help software engineers to explore a system, or to find code in need of refactoring as coding standards evolve. These also enable language designers to understand the practical uses of language features and idioms over a software corpus. Secondly there are program analysis tools in the style of Coverity which perform deeper program analysis searching for bugs as well as checking adherence to coding standards such as MISRA. The former class are typically implemented on top of relational or deductive databases and make ad-hoc trade-offs between scalability and the amount of source-code detail held - with consequent limitations on the expressiveness of queries. The latter class are more commercially driven and involve more ad-hoc queries over program representations, nonetheless similar pressures encourage user-visible domain-specific languages to specify analyses. We argue that a graph data model and associated query language provides a unifying conceptual model and gives efficient scalable implementation even when storing full source-code detail. It also supports overlays allowing a query DSL to pose queries at a mixture of syntax-tree, type, control-flow-graph or data-flow levels. We describe a prototype source-code query system built on top of Neo4j using its Cypher graph query language; experiments show it scales to multi-million-line programs while also storing full source-code detail

Apollo (Cambridge)

An efficient and scalable platform for java source code analysis using overlaid graph representations

Author: Mycroft A
Ortin F
Rodriguez-Prieto O
Publication venue: IEEE Access
Publication date: 01/01/2020
Field of study

© 2013 IEEE. Although source code programs are commonly written as textual information, they enclose syntactic and semantic information that is usually represented as graphs. This information is used for many different purposes, such as static program analysis, advanced code search, coding guideline checking, software metrics computation, and extraction of semantic and syntactic information to create predictive models. Most of the existing systems that provide these kinds of services are designed ad hoc for the particular purpose they are aimed at. For this reason, we created ProgQuery, a platform to allow users to write their own Java program analyses in a declarative fashion, using graph representations. We modify the Java compiler to compute seven syntactic and semantic representations, and store them in a Neo4j graph database. Such representations are overlaid, meaning that syntactic and semantic nodes of the different graphs are interconnected to allow combining different kinds of information in the queries/analyses. We evaluate ProgQuery and compare it to the related systems. Our platform outperforms the other systems in analysis time, and scales better to program sizes and analysis complexity. Moreover, the queries coded show that ProgQuery is more expressive than the other approaches. The additional information stored by ProgQuery increases the database size and associated insertion time, but these increases are significantly lower than the query/analysis performance gains obtained.Spanish Department of Science, Innovation and Universities under Project RTI2018-099235-B-I00

Repositorio Institucional de la Universidad de Oviedo

Apollo (Cambridge)

Recommended from our members

Source-code queries with graph databases - With application to programming language usage and evolution

Author: Mycroft Alan
Urma RG
Publication venue: Science of Computer Programming
Publication date: 01/01/2015
Field of study

Apollo (Cambridge)

Variability-aware Neo4j for Analyzing a Graphical Model of a Software Product Line

Author: Chen Xiang
Publication venue: University of Waterloo
Publication date: 09/08/2023
Field of study

A Software product line (SPLs) eases the development of families of related products by managing and integrating a collection of mandatory and optional features (units of functionality). Individual products can be derived from the product line by selecting among the optional features. Companies that successfully employ SPLs report dramatic improvements in rapid product development, software quality, labour needs, support for mass customization, and time to market. In a product line of reasonable size, it is impractical to verify every product because the number of possible feature combinations is exponential in the number of features. As a result, developers might verify a small fraction of products and limit the choices offered to consumers, thereby foregoing one of the greatest promises of product lines — mass customization. To improve the efficiency of analyzing SPLs, (1) we analyze a model of an SPL rather than its code and (2) we analyze the SPL model itself rather than models of its products. We extract a model comprising facts (e.g., functions, variables, assignments) from an SPL’s source-code artifacts. The facts from different software components are linked together into a lightweight model of the code, called a factbase. The resulting factbase is a typed graphical model that can be analyzed using the Neo4j graph database. In this thesis, we lift the Neo4j query engine to reason over a factbase of an entire SPL. By lifting the Neo4j query engine, we enable any analysis that can be expressed in the query language to be applicable to an SPL model. The lifted analyses return variability-aware results, in which each result is annotated with a feature expression denoting the products to which the result applies. We evaluated lifted Neo4j on five real-world open-source SPLs, with respect to ten commonly used analyses of interest. The first evaluation aims at comparing the performance of a post-processing approach versus an on-the-fly approach computing the feature expressions that annotate to variability-aware results of lifted Neo4j. In general, the on-the-fly approach has a smaller runtime compared to the post-processing approach. The second evaluation aims at assessing the overhead of analyzing a model of an SPL versus a model of a single product, which ranges from 1.88% to 456%. In the third evaluation, we compare the outputs and performance of lifted Neo4j to a related work that employs the variability-aware V-Soufflé Datalog engine. We found that lifted Neo4j is usually more efficient than V-Soufflé when returning the same results (i.e., the end points of path results). When lifted Neo4j returns complete path results, it is generally slower than V-Soufflé, although lifted Neo4j can outperform V-Soufflé on analyses that return short fixed-length paths

University of Waterloo's Institutional Repository

A language to query execution traces of Java programs

Author: Verhoustraeten Ludovic
Publication venue
Publication date: 29/08/2016
Field of study

Repository of the University of Namur

Knowledge Extraction and Visualization from Textual Sources Intended for Construction Project Management

Author: Nedeljković Đorđe Lj.
Publication venue: Универзитет у Београду, Грађевински факултет
Publication date: 01/01/2018
Field of study

Током животног циклуса инвестиционог пројекта ствара се велики корпус неструктуираних и полуструктуираних докумената. Традиционални приступи у складиштењу и организовању информација из неструктуираних податка су оријентисани на рад са документима, што их чини неподесним за анализу и издвајање знања. У неструктуираним документима је отежано прикупљање, анализа и поновно коришћење релевантних информација у интегралном облику, што може изазвати проблеме на пројекту услед неблаговремених или неодговарајућих одлука. У овој дисертацији је приказана репрезентација информација издвојених из неструктуираних текстуалних докумената у облику графа значајних фраза, који корисницима треба да омогући визуелизацију и анализу значајних чињеница на пројекту са минималном количином уложеног труда. Са циљем да се конструише доменски независна репрезентација са минималним трудом експерта за претходно конфигурисање, значајне фразе су детектоване у вишејезичном окружењу применом статистичких мера за одређивање корелисаности пара речи. Граф садржи аутоматски издвојене значајне фразе које су повезане на основу сличности семантичких контекста. Репрезентација је имплементирана у графовској бази података што корисницима омогућава да детектују и визуелизују различите скривене обрасце у подацима. Неинформативне фразе су филтриране кроз поступке одређивања ентропије скупа контекста и динамичности суседства фразе кроз више графова који представљају тренутке у времену. Приказана је хеуристика за издвајање комплексних концепата, заснована на итеративној процедури за детекцију блиских фраза које припадају истом семантичком подграфу. Могућности примене предложене репрезентације су демонстриране на графу конструисаном за постојећи корпус докумената са међународног инвестиционог пројекта.During a construction project lifecycle, an extensive corpus of unstructured or semi-structured text documents is generated. Traditional approaches for information storing and organizing are document-oriented, which is highly inconvenient for data analysis and knowledge extraction. The nature of unstructured sources impedes users’ acquisition, analysis, and reuse of relevant information, leading to possible negative effects in the project management process. This dissertation suggests a procedure for automatic extraction of relevant project concepts from unstructured text documents. Concepts are organized in the form of a key-phrase network, intended to provide users with the possibility to visualize and analyze valuable project facts with less effort. With the objective of constructing a domain-independent and language-independent key-phrase network, with minimal expert involvement for configuration, an approach to detect key phrases was examined by using measures of correlation for word pairs. A network contains key phrases automatically extracted from various types of unstructured documents, with relations based on the similarity of semantic contexts. The representation was implemented as a graph database, enabling project participants to extract and visualize various patterns in data. The problem of noisy key phrases was reduced by introducing the entropy score for a set of co-occurring contexts and the measure of phrase neighborhood dynamics throughout construction project lifecycle. A heuristic for extraction of complex concepts is presented, based on the iterative procedure for detection of adjacent key phrases belonging to a same semantic subnetwork. Possible applications, such as concept tracking through time or determination of communication patterns between project participants, is demonstrated using a key-phrase network generated for the existing document corpus from an international construction project

National Repository of Dissertations in Serbia (NaRDuS)

GraFar - Repository of the Faculty of Civil Engineering

Nardus

Graph database management systems: storage, management and query processing

Author: Goonetilleke T
Publication venue: RMIT University
Publication date
Field of study

The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the ‘edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters

RMIT Research Repository