358 research outputs found
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Simplifying Deep-Learning-Based Model for Code Search
To accelerate software development, developers frequently search and reuse
existing code snippets from a large-scale codebase, e.g., GitHub. Over the
years, researchers proposed many information retrieval (IR) based models for
code search, which match keywords in query with code text. But they fail to
connect the semantic gap between query and code. To conquer this challenge, Gu
et al. proposed a deep-learning-based model named DeepCS. It jointly embeds
method code and natural language description into a shared vector space, where
methods related to a natural language query are retrieved according to their
vector similarities. However, DeepCS' working process is complicated and
time-consuming. To overcome this issue, we proposed a simplified model
CodeMatcher that leverages the IR technique but maintains many features in
DeepCS. Generally, CodeMatcher combines query keywords with the original order,
performs a fuzzy search on name and body strings of methods, and returned the
best-matched methods with the longer sequence of used keywords. We verified its
effectiveness on a large-scale codebase with about 41k repositories.
Experimental results showed the simplified model CodeMatcher outperforms DeepCS
by 97% in terms of MRR (a widely used accuracy measure for code search), and it
is over 66 times faster than DeepCS. Besides, comparing with the
state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by
73%. We also observed that: fusing the advantages of IR-based and
deep-learning-based models is promising because they compensate with each other
by nature; improving the quality of method naming helps code search, since
method name plays an important role in connecting query and code
Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation
With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information.
Therefore, search technologies need to handle content written in
multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection.
Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language.
Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output.
In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs.
Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and
translation technologies
Split, Encode and Aggregate for Long Code Search
Code search with natural language plays a crucial role in reusing existing
code snippets and accelerating software development. Thanks to the
Transformer-based pretraining models, the performance of code search has been
improved significantly compared to traditional information retrieval (IR) based
models. However, due to the quadratic complexity of multi-head self-attention,
there is a limit on the input token length. For efficient training on standard
GPUs like V100, existing pretrained code models, including GraphCodeBERT,
CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes
them unable to represent the complete information of long code that is greater
than 256 tokens. Unlike long text paragraph that can be regarded as a whole
with complete semantics, the semantics of long code is discontinuous as a piece
of long code may contain different code modules. Therefore, it is unreasonable
to directly apply the long text processing methods to long code. To tackle the
long code problem, we propose SEA (Split, Encode and Aggregate for Long Code
Search), which splits long code into code blocks, encodes these blocks into
embeddings, and aggregates them to obtain a comprehensive long code
representation. With SEA, we could directly use Transformer-based pretraining
models to model long code without changing their internal structure and
repretraining. Leveraging abstract syntax tree (AST) based splitting and
attention-based aggregation methods, SEA achieves significant improvements in
long code search performance. We also compare SEA with two sparse Trasnformer
methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean
reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on
the CodeSearchNet benchmark.Comment: 9 page
Does BLEU Score Work for Code Migration?
Statistical machine translation (SMT) is a fast-growing sub-field of
computational linguistics. Until now, the most popular automatic metric to
measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score.
Lately, SMT along with the BLEU metric has been applied to a Software
Engineering task named code migration. (In)Validating the use of BLEU score
could advance the research and development of SMT-based code migration tools.
Unfortunately, there is no study to approve or disapprove the use of BLEU score
for source code. In this paper, we conducted an empirical study on BLEU score
to (in)validate its suitability for the code migration task due to its
inability to reflect the semantics of source code. In our work, we use human
judgment as the ground truth to measure the semantic correctness of the
migrated code. Our empirical study demonstrates that BLEU does not reflect
translation quality due to its weak correlation with the semantic correctness
of translated code. We provided counter-examples to show that BLEU is
ineffective in comparing the translation quality between SMT-based models. Due
to BLEU's ineffectiveness for code migration task, we propose an alternative
metric RUBY, which considers lexical, syntactical, and semantic representations
of source code. We verified that RUBY achieves a higher correlation coefficient
with the semantic correctness of migrated code, 0.775 in comparison with 0.583
of BLEU score. We also confirmed the effectiveness of RUBY in reflecting the
changes in translation quality of SMT-based translation models. With its
advantages, RUBY can be used to evaluate SMT-based code migration models.Comment: 12 pages, 5 figures, ICPC '19 Proceedings of the 27th International
Conference on Program Comprehensio
- …