1,807 research outputs found
Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering
Abstract-An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods of efficiently and accurately measuring the precision of a clone detection tool. Instead, tool authors simply validate a small random sample of the clones their tools detected in a subject software system. Since there could be many thousands of clones reported by the tool, such a small random sample cannot guarantee an accurate and generalized measure of the tool's precision for all the varieties of clones that can occur in any arbitrary software system. In this paper, we propose a machine-learning based approach that can cluster similar clones together, and which can be used to maximize the variety of clones examined when measuring precision, while significantly reducing the biases a specific subject system has on the generality of the precision measured. Our technique reduces the efforts in measuring precision, while doubling the variety of clones validated and reducing biases that harm the generality of the measure by up to an order of magnitude. Our case study with the NiCad clone detector and the Java class library shows that our approach is effective in efficiently measuring an accurate and generalized precision of a subject clone detection tool
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
Change Impact Analysis of Code Clones
Copying a code fragment and reusing it with or without modifications is known to be a frequent activity in software development. This results in exact or closely similar copies of code fragments, known as code clones, to exist in the software systems. Developers leverage the code reuse opportunity by code cloning for increased productivity. However, different studies on code clones report important concerns regarding the impacts of clones on software maintenance. One of the key concerns is to maintain consistent evolution of the clone fragments as inconsistent changes to clones may introduce bugs. Challenges to the consistent evolution of clones involve the identification of all related clone fragments for change propagation when a cloned fragment is changed. The task of identifying the ripple effects (i.e., all the related components to change) is known as Change Impact Analysis (CIA). In this thesis, we evaluate the impacts of clones on software systems from new perspectives and then we propose an evolutionary coupling based technique for change impact analysis of clones. First, we empirically evaluate the comparative stability of cloned and non-cloned code using fine-grained syntactic change types. Second, we assess the impacts of clones from the perspective of coupling at the domain level. Third, we carry out a comprehensive analysis of the comparative stability of cloned and non-cloned code within a uniform framework. We compare stability metrics with the results from the original experimental settings with respect to the clone detection tools and the subject systems. Fourth, we investigate the relationships between stability and bug-proneness of clones to assess whether and how stability contribute to the bug-proneness of different types of clones. Next, in the fifth study, we analyzed the impacts of co-change coupling on the bug-proneness of different types of clones. After a comprehensive evaluation of the impacts of clones on software systems, we propose an evolutionary coupling based CIA approach to support the consistent evolution of clones. In the sixth study, we propose a solution to minimize the effects of atypical commits (extra large commits) on the accuracy of the detection of evolutionary coupling. We propose a clustering-based technique to split atypical commits into pseudo-commits of related entities. This considerably reduces the number of incorrect couplings introduced by the atypical commits. Finally, in the seventh study, we propose an evolutionary coupling based change impact analysis approach for clones. In addition to handling the atypical commits, we use the history of fine-grained syntactic changes extracted from the software repositories to detect typed evolutionary coupling of clones. Conventional approaches consider only the frequency of co-change of the entities to detect evolutionary coupling. We consider both change frequencies and the fine-grained change types in the detection of evolutionary coupling. Findings from our studies give important insights regarding the impacts of clones and our proposed typed evolutionary coupling based CIA approach has the potential to support the consistent evolution of clones for better clone management
Metric Selection and Metric Learning for Matching Tasks
A quarter of a century after the world-wide web was born, we have grown accustomed to having easy access to a wealth of data sets and open-source software. The value of these resources is restricted if they are not properly integrated and maintained. A lot of this work boils down to matching; finding existing records about entities and enriching them with information from a new data source. In the realm of code this means integrating new code snippets into a code base while avoiding duplication.
In this thesis, we address two different such matching problems. First, we leverage the diverse and mature set of string similarity measures in an iterative semisupervised learning approach to string matching. It is designed to query a user to make a sequence of decisions on specific cases of string matching. We show that we can find almost optimal solutions after only a small amount of such input. The low labelling complexity of our algorithm is due to addressing the cold start problem that is inherent to Active Learning; by ranking queries by variance before the arrival of enough supervision information, and by a self-regulating mechanism that counteracts initial biases.
Second, we address the matching of code fragments for deduplication. Programming code is not only a tool, but also a resource that itself demands maintenance. Code duplication is a frequent problem arising especially from modern development practice. There are many reasons to detect and address code duplicates, for example to keep a clean and maintainable codebase. In such more complex data structures, string similarity measures are inadequate. In their stead, we study a modern supervised Metric Learning approach to model code similarity with Neural Networks. We find that in such a model representing the elementary tokens with a pretrained word embedding is the most important ingredient. Our results show both qualitatively (by visualization) that relatedness is modelled well by the embeddings and quantitatively (by ablation) that the encoded information is useful for the downstream matching task.
As a non-technical contribution, we unify the common challenges arising in supervised learning approaches to Record Matching, Code Clone Detection and generic Metric Learning tasks. We give a novel account to string similarity measures from a psychological standpoint and point out and document one longstanding naming conflict in string similarity measures. Finally, we point out the overlap of latest research in Code Clone Detection with the field of Natural Language Processing
Towards Semantic Clone Detection, Benchmarking, and Evaluation
Developers copy and paste their code to speed up the development process. Sometimes, they copy code from other systems or look up code online to solve a complex problem. Developers reuse copied code with or without modifications. The resulting similar or identical code fragments are called code clones. Sometimes clones are unintentionally written when a developer implements the same or similar functionality. Even when the resulting code fragments are not textually similar but implement the same functionality they are still considered to be clones and are classified as semantic clones. Semantic clones are defined as code fragments that perform the exact same computation and are implemented using different syntax.
Software cloning research indicates that code clones exist in all software systems; on average, 5% to 20% of software code is cloned. Due to the potential impact of clones, whether positive or negative, it is essential to locate, track, and manage clones in the source code. Considerable research has been conducted on all types of code clones, including clone detection, analysis, management, and evaluation. Despite the great interest in code clones, there has been considerably less work conducted on semantic clones.
As described in this thesis, I advance the state-of-the-art in semantic clone research in several ways. First, I conducted an empirical study to investigate the status of code cloning in and across open-source game systems and the effectiveness of different normalization, filtering, and transformation techniques for detecting semantic clones. Second, I developed an approach to detect clones across .NET programming languages using an intermediate language. Third, I developed a technique using an intermediate language and an ontology to detect semantic clones. Fourth, I mined Stack Overflow answers to build a semantic code clone benchmark that represents real semantic code clones in four programming languages, C, C#, Java, and Python. Fifth, I defined a comprehensive taxonomy that identifies semantic clone types. Finally, I implemented an injection framework that uses the benchmark to compare and evaluate semantic code clone detectors by automatically measuring recall
Multiplex immunofluorescence to measure dynamic changes in tumor-infiltrating lymphocytes and PD-L1 in early-stage breast cancer.
BACKGROUND: The H&E stromal tumor-infiltrating lymphocyte (sTIL) score and programmed death ligand 1 (PD-L1) SP142 immunohistochemistry assay are prognostic and predictive in early-stage breast cancer, but are operator-dependent and may have insufficient precision to characterize dynamic changes in sTILs/PD-L1 in the context of clinical research. We illustrate how multiplex immunofluorescence (mIF) combined with statistical modeling can be used to precisely estimate dynamic changes in sTIL score, PD-L1 expression, and other immune variables from a single paraffin-embedded slide, thus enabling comprehensive characterization of activity of novel immunotherapy agents.
METHODS: Serial tissue was obtained from a recent clinical trial evaluating loco-regional cytokine delivery as a strategy to promote immune cell infiltration and activation in breast tumors. Pre-treatment biopsies and post-treatment tumor resections were analyzed by mIF (PerkinElmer Vectra) using an antibody panel that characterized tumor cells (cytokeratin-positive), immune cells (CD3, CD8, CD163, FoxP3), and PD-L1 expression. mIF estimates of sTIL score and PD-L1 expression were compared to the H&E/SP142 clinical assays. Hierarchical linear modeling was utilized to compare pre- and post-treatment immune cell expression, account for correlation of time-dependent measurement, variation across high-powered magnification views within each subject, and variation between subjects. Simulation methods (Monte Carlo, bootstrapping) were used to evaluate the impact of model and tissue sample size on statistical power.
RESULTS: mIF estimates of sTIL and PD-L1 expression were strongly correlated with their respective clinical assays (p \u3c .001). Hierarchical linear modeling resulted in more precise estimates of treatment-related increases in sTIL, PD-L1, and other metrics such as CD8+ tumor nest infiltration. Statistical precision was dependent on adequate tissue sampling, with at least 15 high-powered fields recommended per specimen. Compared to conventional t-testing of means, hierarchical linear modeling was associated with substantial reductions in enrollment size required (n = 25➔n = 13) to detect the observed increases in sTIL/PD-L1.
CONCLUSION: mIF is useful for quantifying treatment-related dynamic changes in sTILs/PD-L1 and is concordant with clinical assays, but with greater precision. Hierarchical linear modeling can mitigate the effects of intratumoral heterogeneity on immune cell count estimations, allowing for more efficient detection of treatment-related pharmocodynamic effects in the context of clinical trials.
TRIAL REGISTRATION: NCT02950259
Aspect of Code Cloning Towards Software Bug and Imminent Maintenance: A Perspective on Open-source and Industrial Mobile Applications
As a part of the digital era of microtechnology, mobile application (app) development is evolving with
lightning speed to enrich our lives and bring new challenges and risks. In particular, software bugs and
failures cost trillions of dollars every year, including fatalities such as a software bug in a self-driving car
that resulted in a pedestrian fatality in March 2018 and the recent Boeing-737 Max tragedies that resulted
in hundreds of deaths. Software clones (duplicated fragments of code) are also found to be one of the
crucial factors for having bugs or failures in software systems. There have been many significant studies
on software clones and their relationships to software bugs for desktop-based applications. Unfortunately,
while mobile apps have become an integral part of today’s era, there is a marked lack of such studies for
mobile apps. In order to explore this important aspect, in this thesis, first, we studied the characteristics of
software bugs in the context of mobile apps, which might not be prevalent for desktop-based apps such as
energy-related (battery drain while using apps) and compatibility-related (different behaviors of same app
in different devices) bugs/issues. Using Support Vector Machine (SVM), we classified about 3K mobile app
bug reports of different open-source development sites into four categories: crash, energy, functionality and
security bug. We then manually examined a subset of those bugs and found that over 50% of the bug-fixing
code-changes occurred in clone code. There have been a number of studies with desktop-based software
systems that clearly show the harmful impacts of code clones and their relationships to software bugs. Given
that there is a marked lack of such studies for mobile apps, in our second study, we examined 11 open-source
and industrial mobile apps written in two different languages (Java and Swift) and noticed that clone code
is more bug-prone than non-clone code and that industrial mobile apps have a higher code clone ratio than
open-source mobile apps. Furthermore, we correlated our study outcomes with those of existing desktop based studies and surveyed 23 mobile app developers to validate our findings. Along with validating our
findings from the survey, we noticed that around 95% of the developers usually copy/paste (code cloning)
code fragments from the popular Crowd-sourcing platform, Stack Overflow (SO) to their projects and that
over 75% of such developers experience bugs after such activities (the code cloning from SO). Existing studies
with desktop-based systems also showed that while SO is one of the most popular online platforms for code
reuse (and code cloning), SO code fragments are usually toxic in terms of software maintenance perspective.
Thus, in the third study of this thesis, we studied the consequences of code cloning from SO in different open source and industrial mobile apps. We observed that closed-source industrial apps even reused more SO code
fragments than open-source mobile apps and that SO code fragments were more change-prone (such as bug)
than non-SO code fragments. We also experienced that SO code fragments were related to more bugs in
industrial projects than open-source ones. Our studies show how we could efficiently and effectively manage
clone related software bugs for mobile apps by utilizing the positive sides of code cloning while overcoming
(or at least minimizing) the negative consequences of clone fragments
- …