38,326 research outputs found
Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection
Code clones can detrimentally impact software maintenance and manually
detecting them in very large codebases is impractical. Additionally, automated
approaches find detection of Type 3 and Type 4 (inexact) clones very
challenging. While the most recent artificial deep neural networks (for example
BERT-based artificial neural networks) seem to be highly effective in detecting
such clones, their pairwise comparison of every code pair in the target
system(s) is inefficient and scales poorly on large codebases.
We therefore introduce SSCD, a BERT-based clone detection approach that
targets high recall of Type 3 and Type 4 clones at scale (in line with our
industrial partner's requirements). It does so by computing a representative
embedding for each code fragment and finding similar fragments using a nearest
neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other
Neural Network approaches while also using parallel, GPU-accelerated search to
tackle scalability.
This paper details the approach and an empirical assessment towards
configuring and evaluating that approach in industrial setting. The
configuration analysis suggests that shorter input lengths and text-only based
neural network models demonstrate better efficiency in SSCD, while only
slightly decreasing effectiveness. The evaluation results suggest that SSCD is
more effective than state-of-the-art approaches like SAGA and SourcererCC. It
is also highly efficient: in its optimal setting, SSCD effectively locates
clones in the entire 320 million LOC BigCloneBench (a standard clone detection
benchmark) in just under three hours.Comment: 10 pages, 2 figures, 38th IEEE International Conference on Software
Maintenance and Evolutio
SourcererCC: Scaling Code Clone Detection to Big Code
Despite a decade of active research, there is a marked lack in clone
detectors that scale to very large repositories of source code, in particular
for detecting near-miss clones where significant editing activities may take
place in the cloned code. We present SourcererCC, a token-based clone detector
that targets three clone types, and exploits an index to achieve scalability to
large inter-project repositories using a standard workstation. SourcererCC uses
an optimized inverted-index to quickly query the potential clones of a given
code block. Filtering heuristics based on token ordering are used to
significantly reduce the size of the index, the number of code-block
comparisons needed to detect the clones, as well as the number of required
token-comparisons needed to judge a potential clone.
We evaluate the scalability, execution time, recall and precision of
SourcererCC, and compare it to four publicly available and state-of-the-art
tools. To measure recall, we use two recent benchmarks, (1) a large benchmark
of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of
thousands of fine-grained artificial clones. We find SourcererCC has both high
recall and precision, and is able to scale to a large inter-project repository
(250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
VirtFogSim: A parallel toolbox for dynamic energy-delay performance testing and optimization of 5G Mobile-Fog-Cloud virtualized platforms
It is expected that the pervasive deployment of multi-tier 5G-supported Mobile-Fog-Cloudtechnological computing platforms will constitute an effective means to support the real-time execution of future Internet applications by resource- and energy-limited mobile devices. Increasing interest in this emerging networking-computing technology demands the optimization and performance evaluation of several parts of the underlying infrastructures. However, field trials are challenging due to their operational costs, and in every case, the obtained results could be difficult to repeat and customize. These emergingMobile-Fog-Cloud ecosystems still lack, indeed, customizable software tools for the performance simulation of their computing-networking building blocks. Motivated by these considerations, in this contribution, we present VirtFogSim. It is aMATLAB-supported software toolbox that allows the dynamic joint optimization and tracking of the energy and delay performance of Mobile-Fog-Cloud systems for the execution of applications described by general Directed Application Graphs (DAGs). In a nutshell, the main peculiar features of the proposed VirtFogSim toolbox are that: (i) it allows the joint dynamic energy-aware optimization of the placement of the application tasks and the allocation of the needed computing-networking resources under hard constraints on acceptable overall execution times, (ii) it allows the repeatable and customizable simulation of the resulting energy-delay performance of the overall system; (iii) it allows the dynamic tracking of the performed resource allocation under time-varying operational environments, as those typically featuring mobile applications; (iv) it is equipped with a user-friendly Graphic User Interface (GUI) that supports a number of graphic formats for data rendering, and (v) itsMATLAB code is optimized for running atop multi-core parallel execution platforms. To check both the actual optimization and scalability capabilities of the VirtFogSim toolbox, a number of experimental setups featuring different use cases and operational environments are simulated, and their performances are compared
Heterologous screening of hybridomas for the development of broad-specific monoclonal antibodies against deoxynivalenol and its analogues
Hapten heterology was introduced into the steps of hybridoma selection for the development of monoclonal antibodies (MAbs) against deoxynivalenol (DON). Firstly, a novel heterologous DON hapten was synthesised and covalently coupled to proteins (i.e. bovine serum albumin (BSA), ovalbumin and horseradish peroxidase) using the linkage of cyanuric chloride (CC). After immunisation, antisera from different DON immunogens were checked for the presence of useful antibodies. Next, both homologous and heterologous enzyme-linked immunosorbent assays were conducted to screen for hybridomas. It was found that heterologous screening could significantly reduce the proportion of false positives and appeared to be an efficient approach for selecting hybridomas of interest. This strategy resulted in two kinds of broad-selective MAbs against DON and its analogues. They were quite distinct from other reported DON-antibodies in their cross-reactivity profiles. A unique MAb 13H1 derived from DON-CC-BSA immunogen could recognise DON and its analogues in the order of HT-2 toxin > 15-acetyl-DON > DON > nivalenol, with IC50 ranging from 1.14 to 7.69 mu g/ml. Another preferable MAb 10H10 generated from DON-BSA immunogen manifested relatively similar affinity to DON, 3-acetyl-DON and 15-acetyl-DON, with IC50 values of 22, 15 and 34 ng/ml, respectively. This is the first broad-specific MAb against DON and its two acetylated forms and thus it can be used for simultaneous detection of the three mycotoxins
- …