8 research outputs found
Recommended from our members
Source-code queries with graph databases - With application to programming language usage and evolution
Program querying and analysis tools are of growing importance, and occur in two main variants. Firstly there are source-code query languages which help software engineers to explore a system, or to find code in need of refactoring as coding standards evolve. These also enable language designers to understand the practical uses of language features and idioms over a software corpus. Secondly there are program analysis tools in the style of Coverity which perform deeper program analysis searching for bugs as well as checking adherence to coding standards such as MISRA. The former class are typically implemented on top of relational or deductive databases and make ad-hoc trade-offs between scalability and the amount of source-code detail held - with consequent limitations on the expressiveness of queries. The latter class are more commercially driven and involve more ad-hoc queries over program representations, nonetheless similar pressures encourage user-visible domain-specific languages to specify analyses. We argue that a graph data model and associated query language provides a unifying conceptual model and gives efficient scalable implementation even when storing full source-code detail. It also supports overlays allowing a query DSL to pose queries at a mixture of syntax-tree, type, control-flow-graph or data-flow levels. We describe a prototype source-code query system built on top of Neo4j using its Cypher graph query language; experiments show it scales to multi-million-line programs while also storing full source-code detail
An efficient and scalable platform for java source code analysis using overlaid graph representations
Β© 2013 IEEE. Although source code programs are commonly written as textual information, they enclose syntactic and semantic information that is usually represented as graphs. This information is used for many different purposes, such as static program analysis, advanced code search, coding guideline checking, software metrics computation, and extraction of semantic and syntactic information to create predictive models. Most of the existing systems that provide these kinds of services are designed ad hoc for the particular purpose they are aimed at. For this reason, we created ProgQuery, a platform to allow users to write their own Java program analyses in a declarative fashion, using graph representations. We modify the Java compiler to compute seven syntactic and semantic representations, and store them in a Neo4j graph database. Such representations are overlaid, meaning that syntactic and semantic nodes of the different graphs are interconnected to allow combining different kinds of information in the queries/analyses. We evaluate ProgQuery and compare it to the related systems. Our platform outperforms the other systems in analysis time, and scales better to program sizes and analysis complexity. Moreover, the queries coded show that ProgQuery is more expressive than the other approaches. The additional information stored by ProgQuery increases the database size and associated insertion time, but these increases are significantly lower than the query/analysis performance gains obtained.Spanish Department of Science, Innovation and Universities under Project RTI2018-099235-B-I00
Recommended from our members
Source-code queries with graph databases - With application to programming language usage and evolution
Program querying and analysis tools are of growing importance, and occur in two main variants. Firstly there are source-code query languages which help software engineers to explore a system, or to find code in need of refactoring as coding standards evolve. These also enable language designers to understand the practical uses of language features and idioms over a software corpus. Secondly there are program analysis tools in the style of Coverity which perform deeper program analysis searching for bugs as well as checking adherence to coding standards such as MISRA. The former class are typically implemented on top of relational or deductive databases and make ad-hoc trade-offs between scalability and the amount of source-code detail held - with consequent limitations on the expressiveness of queries. The latter class are more commercially driven and involve more ad-hoc queries over program representations, nonetheless similar pressures encourage user-visible domain-specific languages to specify analyses. We argue that a graph data model and associated query language provides a unifying conceptual model and gives efficient scalable implementation even when storing full source-code detail. It also supports overlays allowing a query DSL to pose queries at a mixture of syntax-tree, type, control-flow-graph or data-flow levels. We describe a prototype source-code query system built on top of Neo4j using its Cypher graph query language; experiments show it scales to multi-million-line programs while also storing full source-code detail
Variability-aware Neo4j for Analyzing a Graphical Model of a Software Product Line
A Software product line (SPLs) eases the development of families of related products by managing and integrating a collection of mandatory and optional features (units of functionality). Individual products can be derived from the product line by selecting among the optional features. Companies that successfully employ SPLs report dramatic improvements in rapid product development, software quality, labour needs, support for mass customization, and time to market.
In a product line of reasonable size, it is impractical to verify every product because the number of possible feature combinations is exponential in the number of features. As a result, developers might verify a small fraction of products and limit the choices offered to consumers, thereby foregoing one of the greatest promises of product lines β mass customization.
To improve the efficiency of analyzing SPLs, (1) we analyze a model of an SPL rather than its code and (2) we analyze the SPL model itself rather than models of its products. We extract a model comprising facts (e.g., functions, variables, assignments) from an SPLβs source-code artifacts. The facts from different software components are linked together into a lightweight model of the code, called a factbase. The resulting factbase is a typed graphical model that can be analyzed using the Neo4j graph database.
In this thesis, we lift the Neo4j query engine to reason over a factbase of an entire SPL. By lifting the Neo4j query engine, we enable any analysis that can be expressed in the query language to be applicable to an SPL model. The lifted analyses return variability-aware results, in which each result is annotated with a feature expression denoting the products to which the result applies.
We evaluated lifted Neo4j on five real-world open-source SPLs, with respect to ten commonly used analyses of interest. The first evaluation aims at comparing the performance of a post-processing approach versus an on-the-fly approach computing the feature expressions that annotate to variability-aware results of lifted Neo4j. In general, the on-the-fly approach has a smaller runtime compared to the post-processing approach. The second evaluation aims at assessing the overhead of analyzing a model of an SPL versus a model of a single product, which ranges from 1.88% to 456%. In the third evaluation, we compare the outputs and performance of lifted Neo4j to a related work that employs the variability-aware V-SoufflΓ© Datalog engine. We found that lifted Neo4j is usually more efficient than V-SoufflΓ© when returning the same results (i.e., the end points of path results). When lifted Neo4j returns complete path results, it is generally slower than V-SoufflΓ©, although lifted Neo4j can outperform V-SoufflΓ© on analyses that return short fixed-length paths
Knowledge Extraction and Visualization from Textual Sources Intended for Construction Project Management
Π’ΠΎΠΊΠΎΠΌ ΠΆΠΈΠ²ΠΎΡΠ½ΠΎΠ³ ΡΠΈΠΊΠ»ΡΡΠ° ΠΈΠ½Π²Π΅ΡΡΠΈΡΠΈΠΎΠ½ΠΎΠ³ ΠΏΡΠΎΡΠ΅ΠΊΡΠ° ΡΡΠ²Π°ΡΠ° ΡΠ΅ Π²Π΅Π»ΠΈΠΊΠΈ ΠΊΠΎΡΠΏΡΡ Π½Π΅ΡΡΡΡΠΊΡΡΠΈΡΠ°Π½ΠΈΡ
ΠΈ ΠΏΠΎΠ»ΡΡΡΡΡΠΊΡΡΠΈΡΠ°Π½ΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½Π°ΡΠ°. Π’ΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π°Π»Π½ΠΈ ΠΏΡΠΈΡΡΡΠΏΠΈ Ρ ΡΠΊΠ»Π°Π΄ΠΈΡΡΠ΅ΡΡ ΠΈ ΠΎΡΠ³Π°Π½ΠΈΠ·ΠΎΠ²Π°ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ° ΠΈΠ· Π½Π΅ΡΡΡΡΠΊΡΡΠΈΡΠ°Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠΊΠ° ΡΡ ΠΎΡΠΈΡΠ΅Π½ΡΠΈΡΠ°Π½ΠΈ Π½Π° ΡΠ°Π΄ ΡΠ° Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΈΠΌΠ°, ΡΡΠΎ ΠΈΡ
ΡΠΈΠ½ΠΈ Π½Π΅ΠΏΠΎΠ΄Π΅ΡΠ½ΠΈΠΌ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΈ ΠΈΠ·Π΄Π²Π°ΡΠ°ΡΠ΅ Π·Π½Π°ΡΠ°. Π£ Π½Π΅ΡΡΡΡΠΊΡΡΠΈΡΠ°Π½ΠΈΠΌ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΈΠΌΠ° ΡΠ΅ ΠΎΡΠ΅ΠΆΠ°Π½ΠΎ ΠΏΡΠΈΠΊΡΠΏΡΠ°ΡΠ΅, Π°Π½Π°Π»ΠΈΠ·Π° ΠΈ ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ ΡΠ΅Π»Π΅Π²Π°Π½ΡΠ½ΠΈΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ° Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»Π½ΠΎΠΌ ΠΎΠ±Π»ΠΈΠΊΡ, ΡΡΠΎ ΠΌΠΎΠΆΠ΅ ΠΈΠ·Π°Π·Π²Π°ΡΠΈ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ΅ Π½Π° ΠΏΡΠΎΡΠ΅ΠΊΡΡ ΡΡΠ»Π΅Π΄ Π½Π΅Π±Π»Π°Π³ΠΎΠ²ΡΠ΅ΠΌΠ΅Π½ΠΈΡ
ΠΈΠ»ΠΈ Π½Π΅ΠΎΠ΄Π³ΠΎΠ²Π°ΡΠ°ΡΡΡΠΈΡ
ΠΎΠ΄Π»ΡΠΊΠ°.
Π£ ΠΎΠ²ΠΎΡ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ ΡΠ΅ ΠΏΡΠΈΠΊΠ°Π·Π°Π½Π° ΡΠ΅ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΡΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ° ΠΈΠ·Π΄Π²ΠΎΡΠ΅Π½ΠΈΡ
ΠΈΠ· Π½Π΅ΡΡΡΡΠΊΡΡΠΈΡΠ°Π½ΠΈΡ
ΡΠ΅ΠΊΡΡΡΠ°Π»Π½ΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½Π°ΡΠ° Ρ ΠΎΠ±Π»ΠΈΠΊΡ Π³ΡΠ°ΡΠ° Π·Π½Π°ΡΠ°ΡΠ½ΠΈΡ
ΡΡΠ°Π·Π°, ΠΊΠΎΡΠΈ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΠΌΠ° ΡΡΠ΅Π±Π° Π΄Π° ΠΎΠΌΠΎΠ³ΡΡΠΈ Π²ΠΈΠ·ΡΠ΅Π»ΠΈΠ·Π°ΡΠΈΡΡ ΠΈ Π°Π½Π°Π»ΠΈΠ·Ρ Π·Π½Π°ΡΠ°ΡΠ½ΠΈΡ
ΡΠΈΡΠ΅Π½ΠΈΡΠ° Π½Π° ΠΏΡΠΎΡΠ΅ΠΊΡΡ ΡΠ° ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π½ΠΎΠΌ ΠΊΠΎΠ»ΠΈΡΠΈΠ½ΠΎΠΌ ΡΠ»ΠΎΠΆΠ΅Π½ΠΎΠ³ ΡΡΡΠ΄Π°. Π‘Π° ΡΠΈΡΠ΅ΠΌ Π΄Π° ΡΠ΅ ΠΊΠΎΠ½ΡΡΡΡΠΈΡΠ΅ Π΄ΠΎΠΌΠ΅Π½ΡΠΊΠΈ Π½Π΅Π·Π°Π²ΠΈΡΠ½Π° ΡΠ΅ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΡΠ° ΡΠ° ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π½ΠΈΠΌ ΡΡΡΠ΄ΠΎΠΌ Π΅ΠΊΡΠΏΠ΅ΡΡΠ° Π·Π° ΠΏΡΠ΅ΡΡ
ΠΎΠ΄Π½ΠΎ ΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠΈΡΠ°ΡΠ΅, Π·Π½Π°ΡΠ°ΡΠ½Π΅ ΡΡΠ°Π·Π΅ ΡΡ Π΄Π΅ΡΠ΅ΠΊΡΠΎΠ²Π°Π½Π΅ Ρ Π²ΠΈΡΠ΅ΡΠ΅Π·ΠΈΡΠ½ΠΎΠΌ ΠΎΠΊΡΡΠΆΠ΅ΡΡ ΠΏΡΠΈΠΌΠ΅Π½ΠΎΠΌ ΡΡΠ°ΡΠΈΡΡΠΈΡΠΊΠΈΡ
ΠΌΠ΅ΡΠ° Π·Π° ΠΎΠ΄ΡΠ΅ΡΠΈΠ²Π°ΡΠ΅ ΠΊΠΎΡΠ΅Π»ΠΈΡΠ°Π½ΠΎΡΡΠΈ ΠΏΠ°ΡΠ° ΡΠ΅ΡΠΈ. ΠΡΠ°Ρ ΡΠ°Π΄ΡΠΆΠΈ Π°ΡΡΠΎΠΌΠ°ΡΡΠΊΠΈ ΠΈΠ·Π΄Π²ΠΎΡΠ΅Π½Π΅ Π·Π½Π°ΡΠ°ΡΠ½Π΅ ΡΡΠ°Π·Π΅ ΠΊΠΎΡΠ΅ ΡΡ ΠΏΠΎΠ²Π΅Π·Π°Π½Π΅ Π½Π° ΠΎΡΠ½ΠΎΠ²Ρ ΡΠ»ΠΈΡΠ½ΠΎΡΡΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡΠ°.
Π Π΅ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΡΠ° ΡΠ΅ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ°Π½Π° Ρ Π³ΡΠ°ΡΠΎΠ²ΡΠΊΠΎΡ Π±Π°Π·ΠΈ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΡΡΠΎ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΠΌΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π° Π΄Π° Π΄Π΅ΡΠ΅ΠΊΡΡΡΡ ΠΈ Π²ΠΈΠ·ΡΠ΅Π»ΠΈΠ·ΡΡΡ ΡΠ°Π·Π»ΠΈΡΠΈΡΠ΅ ΡΠΊΡΠΈΠ²Π΅Π½Π΅ ΠΎΠ±ΡΠ°ΡΡΠ΅ Ρ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ°. ΠΠ΅ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½Π΅ ΡΡΠ°Π·Π΅ ΡΡ ΡΠΈΠ»ΡΡΠΈΡΠ°Π½Π΅ ΠΊΡΠΎΠ· ΠΏΠΎΡΡΡΠΏΠΊΠ΅ ΠΎΠ΄ΡΠ΅ΡΠΈΠ²Π°ΡΠ° Π΅Π½ΡΡΠΎΠΏΠΈΡΠ΅ ΡΠΊΡΠΏΠ° ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡΠ° ΠΈ Π΄ΠΈΠ½Π°ΠΌΠΈΡΠ½ΠΎΡΡΠΈ ΡΡΡΠ΅Π΄ΡΡΠ²Π° ΡΡΠ°Π·Π΅ ΠΊΡΠΎΠ· Π²ΠΈΡΠ΅ Π³ΡΠ°ΡΠΎΠ²Π° ΠΊΠΎΡΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ ΡΡΠ΅Π½ΡΡΠΊΠ΅ Ρ Π²ΡΠ΅ΠΌΠ΅Π½Ρ. ΠΡΠΈΠΊΠ°Π·Π°Π½Π° ΡΠ΅ Ρ
Π΅ΡΡΠΈΡΡΠΈΠΊΠ° Π·Π° ΠΈΠ·Π΄Π²Π°ΡΠ°ΡΠ΅ ΠΊΠΎΠΌΠΏΠ»Π΅ΠΊΡΠ½ΠΈΡ
ΠΊΠΎΠ½ΡΠ΅ΠΏΠ°ΡΠ°, Π·Π°ΡΠ½ΠΎΠ²Π°Π½Π° Π½Π° ΠΈΡΠ΅ΡΠ°ΡΠΈΠ²Π½ΠΎΡ ΠΏΡΠΎΡΠ΅Π΄ΡΡΠΈ Π·Π° Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΡ Π±Π»ΠΈΡΠΊΠΈΡ
ΡΡΠ°Π·Π° ΠΊΠΎΡΠ΅ ΠΏΡΠΈΠΏΠ°Π΄Π°ΡΡ ΠΈΡΡΠΎΠΌ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΎΠΌ ΠΏΠΎΠ΄Π³ΡΠ°ΡΡ. ΠΠΎΠ³ΡΡΠ½ΠΎΡΡΠΈ ΠΏΡΠΈΠΌΠ΅Π½Π΅ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π΅ ΡΠ΅ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΡΠ΅ ΡΡ Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΠ°Π½Π΅ Π½Π° Π³ΡΠ°ΡΡ ΠΊΠΎΠ½ΡΡΡΡΠΈΡΠ°Π½ΠΎΠΌ Π·Π° ΠΏΠΎΡΡΠΎΡΠ΅ΡΠΈ ΠΊΠΎΡΠΏΡΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½Π°ΡΠ° ΡΠ° ΠΌΠ΅ΡΡΠ½Π°ΡΠΎΠ΄Π½ΠΎΠ³ ΠΈΠ½Π²Π΅ΡΡΠΈΡΠΈΠΎΠ½ΠΎΠ³ ΠΏΡΠΎΡΠ΅ΠΊΡΠ°.During a construction project lifecycle, an extensive corpus of unstructured or semi-structured text documents is generated. Traditional approaches for information storing and organizing are document-oriented, which is highly inconvenient for data analysis and knowledge extraction. The nature of unstructured sources impedes usersβ acquisition, analysis, and reuse of relevant information, leading to possible negative effects in the project management process.
This dissertation suggests a procedure for automatic extraction of relevant project concepts from unstructured text documents. Concepts are organized in the form of a key-phrase network, intended to provide users with the possibility to visualize and analyze valuable project facts with less effort. With the objective of constructing a domain-independent and language-independent key-phrase network, with minimal expert involvement for configuration, an approach to detect key phrases was examined by using measures of correlation for word pairs. A network contains key phrases automatically extracted from various types of unstructured documents, with relations based on the similarity of semantic contexts.
The representation was implemented as a graph database, enabling project participants to extract and visualize various patterns in data. The problem of noisy key phrases was reduced by introducing the entropy score for a set of co-occurring contexts and the measure of phrase neighborhood dynamics throughout construction project lifecycle. A heuristic for extraction of complex concepts is presented, based on the iterative procedure for detection of adjacent key phrases belonging to a same semantic subnetwork. Possible applications, such as concept tracking through time or determination of communication patterns between project participants, is demonstrated using a key-phrase network generated for the existing document corpus from an international construction project
Graph database management systems: storage, management and query processing
The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the βedge-consecutivenessβ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters