1,192 research outputs found
Mira: A Framework for Static Performance Analysis
The performance model of an application can pro- vide understanding about its
runtime behavior on particular hardware. Such information can be analyzed by
developers for performance tuning. However, model building and analyzing is
frequently ignored during software development until perfor- mance problems
arise because they require significant expertise and can involve many
time-consuming application runs. In this paper, we propose a fast, accurate,
flexible and user-friendly tool, Mira, for generating performance models by
applying static program analysis, targeting scientific applications running on
supercomputers. We parse both the source code and binary to estimate
performance attributes with better accuracy than considering just source or
just binary code. Because our analysis is static, the target program does not
need to be executed on the target architecture, which enables users to perform
analysis on available machines instead of conducting expensive exper- iments on
potentially expensive resources. Moreover, statically generated models enable
performance prediction on non-existent or unavailable architectures. In
addition to flexibility, because model generation time is significantly reduced
compared to dynamic analysis approaches, our method is suitable for rapid
application performance analysis and improvement. We present several scientific
application validation results to demonstrate the current capabilities of our
approach on small benchmarks and a mini application
Lightweight Multilingual Software Analysis
Developer preferences, language capabilities and the persistence of older
languages contribute to the trend that large software codebases are often
multilingual, that is, written in more than one computer language. While
developers can leverage monolingual software development tools to build
software components, companies are faced with the problem of managing the
resultant large, multilingual codebases to address issues with security,
efficiency, and quality metrics. The key challenge is to address the opaque
nature of the language interoperability interface: one language calling
procedures in a second (which may call a third, or even back to the first),
resulting in a potentially tangled, inefficient and insecure codebase. An
architecture is proposed for lightweight static analysis of large multilingual
codebases: the MLSA architecture. Its modular and table-oriented structure
addresses the open-ended nature of multiple languages and language
interoperability APIs. We focus here as an application on the construction of
call-graphs that capture both inter-language and intra-language calls. The
algorithms for extracting multilingual call-graphs from codebases are
presented, and several examples of multilingual software engineering analysis
are discussed. The state of the implementation and testing of MLSA is
presented, and the implications for future work are discussed.Comment: 15 page
MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion
Background: Code summarization automatically generates the corresponding
natural language descriptions according to the input code. Comprehensiveness of
code representation is critical to code summarization task. However, most
existing approaches typically use coarse-grained fusion methods to integrate
multi-modal features. They generally represent different modalities of a piece
of code, such as an Abstract Syntax Tree (AST) and a token sequence, as two
embeddings and then fuse the two ones at the AST/code levels. Such a coarse
integration makes it difficult to learn the correlations between fine-grained
code elements across modalities effectively. Aims: This study intends to
improve the model's prediction performance for high-quality code summarization
by accurately aligning and fully fusing semantic and syntactic structure
information of source code at node/token levels. Method: This paper proposes a
Multi-Modal Fine-grained Feature Fusion approach (MMF3) for neural code
summarization. We introduce a novel fine-grained fusion method, which allows
fine-grained fusion of multiple code modalities at the token and node levels.
Specifically, we use this method to fuse information from both token and AST
modalities and apply the fused features to code summarization. Results: We
conduct experiments on one Java and one Python datasets, and evaluate generated
summaries using four metrics. The results show that: 1) the performance of our
model outperforms the current state-of-the-art models, and 2) the ablation
experiments show that our proposed fine-grained fusion method can effectively
improve the accuracy of generated summaries. Conclusion: MMF3 can mine the
relationships between crossmodal elements and perform accurate fine-grained
element-level alignment fusion accordingly. As a result, more clues can be
provided to improve the accuracy of the generated code summaries.Comment: 12 pages, 5 figure
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation
Large-scale language models have made great progress in the field of software
engineering in recent years. They can be used for many code-related tasks such
as code clone detection, code-to-code search, and method name prediction.
However, these large-scale language models based on each code token have
several drawbacks: They are usually large in scale, heavily dependent on
labels, and require a lot of computing power and time to fine-tune new
datasets.Furthermore, code embedding should be performed on the entire code
snippet rather than encoding each code token. The main reason for this is that
encoding each code token would cause model parameter inflation, resulting in a
lot of parameters storing information that we are not very concerned about. In
this paper, we propose a novel framework, called TransformCode, that learns
about code embeddings in a contrastive learning manner. The framework uses the
Transformer encoder as an integral part of the model. We also introduce a novel
data augmentation technique called abstract syntax tree transformation: This
technique applies syntactic and semantic transformations to the original code
snippets to generate more diverse and robust anchor samples. Our proposed
framework is both flexible and adaptable: It can be easily extended to other
downstream tasks that require code representation such as code clone detection
and classification. The framework is also very efficient and scalable: It does
not require a large model or a large amount of training data, and can support
any programming language.Finally, our framework is not limited to unsupervised
learning, but can also be applied to some supervised learning tasks by
incorporating task-specific labels or objectives. To explore the effectiveness
of our framework, we conducted extensive experiments on different software
engineering tasks using different programming languages and multiple datasets
CroLSSim: Crossâlanguage software similarity detector using hybrid approach of LSAâbased ASTâMDrep features and CNNâLSTM model
Software similarity in different programming codes is a rapidly evolving field because of its numerous applications in software development, software cloning, software plagiarism, and software forensics. Currently, software researchers and developers search cross-language open-source repositories for similar applications for a variety of reasons, such as reusing programming code, analyzing different implementations, and looking for a better application. However, it is a challenging task because each programming language has a unique syntax and semantic structure. In this paper, a novel tool called Cross-Language Software Similarity (CroLSSim) is designed to detect similar software applications written in different programming codes. First, the Abstract Syntax Tree (AST) features are collected from different programming codes. These are high-quality features that can show the abstract view of each program. Then, Methods Description (MDrep) in combination with AST is used to examine the relationship among different method calls. Second, the Term Frequency Inverse Document Frequency approach is used to retrieve the local and global weights from AST-MDrep features. Third, the Latent Semantic Analysis-based features extraction and selection method is proposed to extract the semantic anchors in reduced dimensional space. Fourth, the Convolution Neural Network (CNN)-based features extraction method is proposed to mine the deep features. Finally, a hybrid deep learning model of CNN-Long-Short-Term Memory is designed to detect semantically similar software applications from these latent variables. The data set contains approximately 9.5K Java, 8.8K C#, and 7.4K C++ software applications obtained from GitHub. The proposed approach outperforms as compared with the state-of-the-art methods
Improving pattern tracking with a language-aware tree differencing algorithm
International audienceTracking code fragments of interest is important in monitoring a software project over multiple versions. Various approaches, including our previous work on Herodotos, exploit the notion of Longest Common Subsequence, as computed by readily available tools such as GNU Diff, to map corresponding code fragments. Nevertheless, the efficient code differencing algorithms are typically line-based or word-based, and thus do not report changes at the level of language constructs. Furthermore, they identify only additions and removals, but not the moving of a block of code from one part of a file to another. Code fragments of interest that fall within the added and removed regions of code have to be manually correlated across versions, which is tedious and error-prone. When studying a very large code base over a long time, the number of manual correlations can become an obstacle to the success of a study. In this paper, we investigate the effect of replacing the current line-based algorithm used by Herodotos by tree-matching, as provided by the algorithm of the differencing tool GumTree. In contrast to the line-based approach, the tree-based approach does not generate any manual correlations, but it incurs a high execution time. To address the problem, we propose a hybrid strategy that gives the best of both approaches
- âŠ