29 research outputs found
Knowledge Search within a Company-WIKI
The usage of Wikis for the purpose of knowledge management within a business company is only of value if the stored information can be found easily. The fundamental characteristic of a Wiki, its easy and informal usage, results in large amounts of steadily changing, unstructured documents. The widely used full-text search often provides search results of insufficient accuracy. In this paper, we will present an approach likely to improve search quality, through the use of Semantic Web, Text Mining, and Case Based Reasoning (CBR) technologies. Search results are more precise and complete because, in contrast to full-text search, the proposed knowledge-based search operates on the semantic layer
A Neural Model for Generating Natural Language Summaries of Program Subroutines
Source code summarization -- creating natural language descriptions of source
code behavior -- is a rapidly-growing research topic with applications to
automatic documentation generation, program comprehension, and software
maintenance. Traditional techniques relied on heuristics and templates built
manually by human experts. Recently, data-driven approaches based on neural
machine translation have largely overtaken template-based systems. But nearly
all of these techniques rely almost entirely on programs having good internal
documentation; without clear identifier names, the models fail to create good
summaries. In this paper, we present a neural model that combines words from
code with code structure from an AST. Unlike previous approaches, our model
processes each data source as a separate input, which allows the model to learn
code structure independent of the text in code. This process helps our approach
provide coherent summaries in many cases even when zero internal documentation
is provided. We evaluate our technique with a dataset we created from 2.1m Java
methods. We find improvement over two baseline techniques from SE literature
and one from NLP literature
An Empirical Investigation for Understanding
While working on modernization of large monolithic application; speed , synchronization and interaction with other components are the major concern for practical implementation of target system; as Service-Oriented Computing extends and covering many sections of monolithic legacy to web oriented development, these aspects becoming a new challenges to existing software engineering practices, the paper presents work which is undertaken for service orientation of monolithic legacy application including initial steps of service understanding, comprehension and extraction so that it can take a part in further migration activities to service oriented architecture platform. The work also shows that how several useful techniques can be applied to accomplish the result
Enhancing Source Code Representations for Deep Learning with Static Analysis
Deep learning techniques applied to program analysis tasks such as code
classification, summarization, and bug detection have seen widespread interest.
Traditional approaches, however, treat programming source code as natural
language text, which may neglect significant structural or semantic details.
Additionally, most current methods of representing source code focus solely on
the code, without considering beneficial additional context. This paper
explores the integration of static analysis and additional context such as bug
reports and design patterns into source code representations for deep learning
models. We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and
augment it with additional context information obtained from bug reports and
design patterns, creating an enriched source code representation that
significantly enhances the performance of common software engineering tasks
such as code classification and code clone detection. Utilizing existing
open-source code data, our approach improves the representation and processing
of source code, thereby improving task performance
Similarity of Source Code in the Presence of Pervasive Modifications
Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of pervasive code modifications, i.e. transformations that may have a global effect. We compare 30 similarity detection techniques and tools against pervasive code modifications. We evaluate the tools using two experimental scenarios for Java source code. These are (1) pervasive modifications created with tools for source code and bytecode obfuscation and (2) source code normalisation through compilation and decompilation using different decompilers. Our experimental results show that highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for six of the tools. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code
Sparse Attention-Based Neural Networks for Code Classification
Categorizing source codes accurately and efficiently is a challenging problem
in real-world programming education platform management. In recent years,
model-based approaches utilizing abstract syntax trees (ASTs) have been widely
applied to code classification tasks. We introduce an approach named the Sparse
Attention-based neural network for Code Classification (SACC) in this paper.
The approach involves two main steps: In the first step, source code undergoes
syntax parsing and preprocessing. The generated abstract syntax tree is split
into sequences of subtrees and then encoded using a recursive neural network to
obtain a high-dimensional representation. This step simultaneously considers
both the logical structure and lexical level information contained within the
code. In the second step, the encoded sequences of subtrees are fed into a
Transformer model that incorporates sparse attention mechanisms for the purpose
of classification. This method efficiently reduces the computational cost of
the self-attention mechanisms, thus improving the training speed while
preserving effectiveness. Our work introduces a carefully designed sparse
attention pattern that is specifically designed to meet the unique needs of
code classification tasks. This design helps reduce the influence of redundant
information and enhances the overall performance of the model. Finally, we also
deal with problems in previous related research, which include issues like
incomplete classification labels and a small dataset size. We annotated the
CodeNet dataset with algorithm-related labeling categories, which contains a
significantly large amount of data. Extensive comparative experimental results
demonstrate the effectiveness and efficiency of SACC for the code
classification tasks.Comment: 2023 3rd International Conference on Digital Society and Intelligent
Systems (DSInS 2023
Architectural Layer Recovery for Software System Understanding and Evolution
This paper presents an approach to identify software layers for the understanding and evolution of software systems implemented with any object-oriented programming language. The approach first identifies relations between the classes of a software system and then uses a link analysis algorithm (i.e. the Kleinberg algorithm) to group them into layers. Additionally to assess the approach and the underlying techniques, the paper also presents a prototype of a supporting tool and the results from a case study
Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments
Summary descriptions of subroutines are short (usually one-sentence) natural
language explanations of a subroutine's behavior and purpose in a program.
These summaries are ubiquitous in documentation, and many tools such as
JavaDocs and Doxygen generate documentation built around them. And yet,
extracting summaries from unstructured source code repositories remains a
difficult research problem -- it is very difficult to generate clean structured
documentation unless the summaries are annotated by programmers. This becomes a
problem in large repositories of legacy code, since it is cost prohibitive to
retroactively annotate summaries in dozens or hundreds of old programs.
Likewise, it is a problem for creators of automatic documentation generation
algorithms, since these algorithms usually must learn from large annotated
datasets, which do not exist for many programming languages. In this paper, we
present a semi-automated approach via crowdsourcing and a fully-automated
approach for annotating summaries from unstructured code comments. We present
experiments validating the approaches, and provide recommendations and cost
estimates for automatically annotating large repositories.Comment: 10 pages, plus references. Accepted for publication in the 27th IEEE
International Conference on. Software Analysis, Evolution and Reengineering
London, Ontario, Canada, February 18-21, 202
An Investigation of Clustering Algorithms in the Identification of Similar Web Pages
In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level