8 research outputs found

    An efficient and scalable platform for java source code analysis using overlaid graph representations

    Get PDF
    Β© 2013 IEEE. Although source code programs are commonly written as textual information, they enclose syntactic and semantic information that is usually represented as graphs. This information is used for many different purposes, such as static program analysis, advanced code search, coding guideline checking, software metrics computation, and extraction of semantic and syntactic information to create predictive models. Most of the existing systems that provide these kinds of services are designed ad hoc for the particular purpose they are aimed at. For this reason, we created ProgQuery, a platform to allow users to write their own Java program analyses in a declarative fashion, using graph representations. We modify the Java compiler to compute seven syntactic and semantic representations, and store them in a Neo4j graph database. Such representations are overlaid, meaning that syntactic and semantic nodes of the different graphs are interconnected to allow combining different kinds of information in the queries/analyses. We evaluate ProgQuery and compare it to the related systems. Our platform outperforms the other systems in analysis time, and scales better to program sizes and analysis complexity. Moreover, the queries coded show that ProgQuery is more expressive than the other approaches. The additional information stored by ProgQuery increases the database size and associated insertion time, but these increases are significantly lower than the query/analysis performance gains obtained.Spanish Department of Science, Innovation and Universities under Project RTI2018-099235-B-I00

    Variability-aware Neo4j for Analyzing a Graphical Model of a Software Product Line

    Get PDF
    A Software product line (SPLs) eases the development of families of related products by managing and integrating a collection of mandatory and optional features (units of functionality). Individual products can be derived from the product line by selecting among the optional features. Companies that successfully employ SPLs report dramatic improvements in rapid product development, software quality, labour needs, support for mass customization, and time to market. In a product line of reasonable size, it is impractical to verify every product because the number of possible feature combinations is exponential in the number of features. As a result, developers might verify a small fraction of products and limit the choices offered to consumers, thereby foregoing one of the greatest promises of product lines β€” mass customization. To improve the efficiency of analyzing SPLs, (1) we analyze a model of an SPL rather than its code and (2) we analyze the SPL model itself rather than models of its products. We extract a model comprising facts (e.g., functions, variables, assignments) from an SPL’s source-code artifacts. The facts from different software components are linked together into a lightweight model of the code, called a factbase. The resulting factbase is a typed graphical model that can be analyzed using the Neo4j graph database. In this thesis, we lift the Neo4j query engine to reason over a factbase of an entire SPL. By lifting the Neo4j query engine, we enable any analysis that can be expressed in the query language to be applicable to an SPL model. The lifted analyses return variability-aware results, in which each result is annotated with a feature expression denoting the products to which the result applies. We evaluated lifted Neo4j on five real-world open-source SPLs, with respect to ten commonly used analyses of interest. The first evaluation aims at comparing the performance of a post-processing approach versus an on-the-fly approach computing the feature expressions that annotate to variability-aware results of lifted Neo4j. In general, the on-the-fly approach has a smaller runtime compared to the post-processing approach. The second evaluation aims at assessing the overhead of analyzing a model of an SPL versus a model of a single product, which ranges from 1.88% to 456%. In the third evaluation, we compare the outputs and performance of lifted Neo4j to a related work that employs the variability-aware V-SoufflΓ© Datalog engine. We found that lifted Neo4j is usually more efficient than V-SoufflΓ© when returning the same results (i.e., the end points of path results). When lifted Neo4j returns complete path results, it is generally slower than V-SoufflΓ©, although lifted Neo4j can outperform V-SoufflΓ© on analyses that return short fixed-length paths

    Knowledge Extraction and Visualization from Textual Sources Intended for Construction Project Management

    Get PDF
    Π’ΠΎΠΊΠΎΠΌ ΠΆΠΈΠ²ΠΎΡ‚Π½ΠΎΠ³ циклуса инвСстиционог ΠΏΡ€ΠΎΡ˜Π΅ΠΊΡ‚Π° ствара сС Π²Π΅Π»ΠΈΠΊΠΈ корпус нСструктуираних ΠΈ полуструктуираних Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Π°Ρ‚Π°. Π’Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π°Π»Π½ΠΈ приступи Ρƒ ΡΠΊΠ»Π°Π΄ΠΈΡˆΡ‚Π΅ΡšΡƒ ΠΈ ΠΎΡ€Π³Π°Π½ΠΈΠ·ΠΎΠ²Π°ΡšΡƒ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ˜Π° ΠΈΠ· нСструктуираних ΠΏΠΎΠ΄Π°Ρ‚ΠΊΠ° су ΠΎΡ€ΠΈΡ˜Π΅Π½Ρ‚ΠΈΡΠ°Π½ΠΈ Π½Π° Ρ€Π°Π΄ са Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΈΠΌΠ°, ΡˆΡ‚ΠΎ ΠΈΡ… Ρ‡ΠΈΠ½ΠΈ нСподСсним Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρƒ ΠΈ издвајањС знања. Π£ нСструктуираним Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΈΠΌΠ° јС ΠΎΡ‚Π΅ΠΆΠ°Π½ΠΎ ΠΏΡ€ΠΈΠΊΡƒΠΏΡ™Π°ΡšΠ΅, Π°Π½Π°Π»ΠΈΠ·Π° ΠΈ ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎ ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ Ρ€Π΅Π»Π΅Π²Π°Π½Ρ‚Π½ΠΈΡ… ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ˜Π° Ρƒ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»Π½ΠΎΠΌ ΠΎΠ±Π»ΠΈΠΊΡƒ, ΡˆΡ‚ΠΎ ΠΌΠΎΠΆΠ΅ ΠΈΠ·Π°Π·Π²Π°Ρ‚ΠΈ ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΠ΅ Π½Π° ΠΏΡ€ΠΎΡ˜Π΅ΠΊΡ‚Ρƒ услСд Π½Π΅Π±Π»Π°Π³ΠΎΠ²Ρ€Π΅ΠΌΠ΅Π½ΠΈΡ… ΠΈΠ»ΠΈ Π½Π΅ΠΎΠ΄Π³ΠΎΠ²Π°Ρ€Π°Ρ˜ΡƒΡ›ΠΈΡ… ΠΎΠ΄Π»ΡƒΠΊΠ°. Π£ овој Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜ΠΈ јС ΠΏΡ€ΠΈΠΊΠ°Π·Π°Π½Π° Ρ€Π΅ΠΏΡ€Π΅Π·Π΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π° ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ˜Π° ΠΈΠ·Π΄Π²ΠΎΡ˜Π΅Π½ΠΈΡ… ΠΈΠ· нСструктуираних тСкстуалних Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Π°Ρ‚Π° Ρƒ ΠΎΠ±Π»ΠΈΠΊΡƒ Π³Ρ€Π°Ρ„Π° Π·Π½Π°Ρ‡Π°Ρ˜Π½ΠΈΡ… Ρ„Ρ€Π°Π·Π°, који корисницима Ρ‚Ρ€Π΅Π±Π° Π΄Π° ΠΎΠΌΠΎΠ³ΡƒΡ›ΠΈ Π²ΠΈΠ·ΡƒΠ΅Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜Ρƒ ΠΈ Π°Π½Π°Π»ΠΈΠ·Ρƒ Π·Π½Π°Ρ‡Π°Ρ˜Π½ΠΈΡ… Ρ‡ΠΈΡšΠ΅Π½ΠΈΡ†Π° Π½Π° ΠΏΡ€ΠΎΡ˜Π΅ΠΊΡ‚Ρƒ са ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π½ΠΎΠΌ ΠΊΠΎΠ»ΠΈΡ‡ΠΈΠ½ΠΎΠΌ ΡƒΠ»ΠΎΠΆΠ΅Π½ΠΎΠ³ Ρ‚Ρ€ΡƒΠ΄Π°. Π‘Π° Ρ†ΠΈΡ™Π΅ΠΌ Π΄Π° сС ΠΊΠΎΠ½ΡΡ‚Ρ€ΡƒΠΈΡˆΠ΅ домСнски нСзависна Ρ€Π΅ΠΏΡ€Π΅Π·Π΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π° са ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π½ΠΈΠΌ Ρ‚Ρ€ΡƒΠ΄ΠΎΠΌ СкспСрта Π·Π° ΠΏΡ€Π΅Ρ‚Ρ…ΠΎΠ΄Π½ΠΎ ΠΊΠΎΠ½Ρ„ΠΈΠ³ΡƒΡ€ΠΈΡΠ°ΡšΠ΅, Π·Π½Π°Ρ‡Π°Ρ˜Π½Π΅ Ρ„Ρ€Π°Π·Π΅ су Π΄Π΅Ρ‚Π΅ΠΊΡ‚ΠΎΠ²Π°Π½Π΅ Ρƒ Π²ΠΈΡˆΠ΅Ρ˜Π΅Π·ΠΈΡ‡Π½ΠΎΠΌ ΠΎΠΊΡ€ΡƒΠΆΠ΅ΡšΡƒ ΠΏΡ€ΠΈΠΌΠ΅Π½ΠΎΠΌ статистичких ΠΌΠ΅Ρ€Π° Π·Π° ΠΎΠ΄Ρ€Π΅Ρ’ΠΈΠ²Π°ΡšΠ΅ корСлисаности ΠΏΠ°Ρ€Π° Ρ€Π΅Ρ‡ΠΈ. Π“Ρ€Π°Ρ„ садрТи аутоматски издвојСнС Π·Π½Π°Ρ‡Π°Ρ˜Π½Π΅ Ρ„Ρ€Π°Π·Π΅ којС су ΠΏΠΎΠ²Π΅Π·Π°Π½Π΅ Π½Π° основу сличности сСмантичких контСкста. Π Π΅ΠΏΡ€Π΅Π·Π΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π° јС ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚ΠΈΡ€Π°Π½Π° Ρƒ Π³Ρ€Π°Ρ„ΠΎΠ²ΡΠΊΠΎΡ˜ Π±Π°Π·ΠΈ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΡˆΡ‚ΠΎ корисницима ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π° Π΄Π° Π΄Π΅Ρ‚Π΅ΠΊΡ‚ΡƒΡ˜Ρƒ ΠΈ Π²ΠΈΠ·ΡƒΠ΅Π»ΠΈΠ·ΡƒΡ˜Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ скривСнС обрасцС Ρƒ ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ°. НСинформативнС Ρ„Ρ€Π°Π·Π΅ су Ρ„ΠΈΠ»Ρ‚Ρ€ΠΈΡ€Π°Π½Π΅ ΠΊΡ€ΠΎΠ· поступкС ΠΎΠ΄Ρ€Π΅Ρ’ΠΈΠ²Π°ΡšΠ° Π΅Π½Ρ‚Ρ€ΠΎΠΏΠΈΡ˜Π΅ скупа контСкста ΠΈ динамичности сусСдства Ρ„Ρ€Π°Π·Π΅ ΠΊΡ€ΠΎΠ· вишС Π³Ρ€Π°Ρ„ΠΎΠ²Π° који ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Ρ™Π°Ρ˜Ρƒ Ρ‚Ρ€Π΅Π½ΡƒΡ‚ΠΊΠ΅ Ρƒ Π²Ρ€Π΅ΠΌΠ΅Π½Ρƒ. ΠŸΡ€ΠΈΠΊΠ°Π·Π°Π½Π° јС хСуристика Π·Π° издвајањС комплСксних ΠΊΠΎΠ½Ρ†Π΅ΠΏΠ°Ρ‚Π°, заснована Π½Π° ΠΈΡ‚Π΅Ρ€Π°Ρ‚ΠΈΠ²Π½ΠΎΡ˜ ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€ΠΈ Π·Π° Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜Ρƒ блиских Ρ„Ρ€Π°Π·Π° којС ΠΏΡ€ΠΈΠΏΠ°Π΄Π°Ρ˜Ρƒ истом сСмантичком ΠΏΠΎΠ΄Π³Ρ€Π°Ρ„Ρƒ. ΠœΠΎΠ³ΡƒΡ›Π½ΠΎΡΡ‚ΠΈ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π΅ Ρ€Π΅ΠΏΡ€Π΅Π·Π΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π΅ су дСмонстриранС Π½Π° Π³Ρ€Π°Ρ„Ρƒ конструисаном Π·Π° ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›ΠΈ корпус Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Π°Ρ‚Π° са ΠΌΠ΅Ρ’ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½ΠΎΠ³ инвСстиционог ΠΏΡ€ΠΎΡ˜Π΅ΠΊΡ‚Π°.During a construction project lifecycle, an extensive corpus of unstructured or semi-structured text documents is generated. Traditional approaches for information storing and organizing are document-oriented, which is highly inconvenient for data analysis and knowledge extraction. The nature of unstructured sources impedes users’ acquisition, analysis, and reuse of relevant information, leading to possible negative effects in the project management process. This dissertation suggests a procedure for automatic extraction of relevant project concepts from unstructured text documents. Concepts are organized in the form of a key-phrase network, intended to provide users with the possibility to visualize and analyze valuable project facts with less effort. With the objective of constructing a domain-independent and language-independent key-phrase network, with minimal expert involvement for configuration, an approach to detect key phrases was examined by using measures of correlation for word pairs. A network contains key phrases automatically extracted from various types of unstructured documents, with relations based on the similarity of semantic contexts. The representation was implemented as a graph database, enabling project participants to extract and visualize various patterns in data. The problem of noisy key phrases was reduced by introducing the entropy score for a set of co-occurring contexts and the measure of phrase neighborhood dynamics throughout construction project lifecycle. A heuristic for extraction of complex concepts is presented, based on the iterative procedure for detection of adjacent key phrases belonging to a same semantic subnetwork. Possible applications, such as concept tracking through time or determination of communication patterns between project participants, is demonstrated using a key-phrase network generated for the existing document corpus from an international construction project

    Graph database management systems: storage, management and query processing

    Get PDF
    The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the β€˜edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters
    corecore