12,596 research outputs found
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges
Measuring and evaluating source code similarity is a fundamental software
engineering activity that embraces a broad range of applications, including but
not limited to code recommendation, duplicate code, plagiarism, malware, and
smell detection. This paper proposes a systematic literature review and
meta-analysis on code similarity measurement and evaluation techniques to shed
light on the existing approaches and their characteristics in different
applications. We initially found over 10000 articles by querying four digital
libraries and ended up with 136 primary studies in the field. The studies were
classified according to their methodology, programming languages, datasets,
tools, and applications. A deep investigation reveals 80 software tools,
working with eight different techniques on five application domains. Nearly 49%
of the tools work on Java programs and 37% support C and C++, while there is no
support for many programming languages. A noteworthy point was the existence of
12 datasets related to source code similarity measurement and duplicate codes,
of which only eight datasets were publicly accessible. The lack of reliable
datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm
languages are the main challenges in the field. Emerging applications of code
similarity measurement concentrate on the development phase in addition to the
maintenance.Comment: 49 pages, 10 figures, 6 table
DeepOnto: A Python Package for Ontology Engineering with Deep Learning
Applying deep learning techniques, particularly language models (LMs), in
ontology engineering has raised widespread attention. However, deep learning
frameworks like PyTorch and Tensorflow are predominantly developed for Python
programming, while widely-used ontology APIs, such as the OWL API and Jena, are
primarily Java-based. To facilitate seamless integration of these frameworks
and APIs, we present Deeponto, a Python package designed for ontology
engineering. The package encompasses a core ontology processing module founded
on the widely-recognised and reliable OWL API, encapsulating its fundamental
features in a more "Pythonic" manner and extending its capabilities to include
other essential components including reasoning, verbalisation, normalisation,
projection, and more. Building on this module, Deeponto offers a suite of
tools, resources, and algorithms that support various ontology engineering
tasks, such as ontology alignment and completion, by harnessing deep learning
methodologies, primarily pre-trained LMs. In this paper, we also demonstrate
the practical utility of Deeponto through two use-cases: the Digital Health
Coaching in Samsung Research UK and the Bio-ML track of the Ontology Alignment
Evaluation Initiative (OAEI).Comment: under review at Semantic Web Journa
Eunomia: Enabling User-specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries
Although existing techniques have proposed automated approaches to alleviate
the path explosion problem of symbolic execution, users still need to optimize
symbolic execution by applying various searching strategies carefully. As
existing approaches mainly support only coarse-grained global searching
strategies, they cannot efficiently traverse through complex code structures.
In this paper, we propose Eunomia, a symbolic execution technique that allows
users to specify local domain knowledge to enable fine-grained search. In
Eunomia, we design an expressive DSL, Aes, that lets users precisely pinpoint
local searching strategies to different parts of the target program. To further
optimize local searching strategies, we design an interval-based algorithm that
automatically isolates the context of variables for different local searching
strategies, avoiding conflicts between local searching strategies for the same
variable. We implement Eunomia as a symbolic execution platform targeting
WebAssembly, which enables us to analyze applications written in various
languages (like C and Go) but can be compiled into WebAssembly. To the best of
our knowledge, Eunomia is the first symbolic execution engine that supports the
full features of the WebAssembly runtime. We evaluate Eunomia with a dedicated
microbenchmark suite for symbolic execution and six real-world applications.
Our evaluation shows that Eunomia accelerates bug detection in real-world
applications by up to three orders of magnitude. According to the results of a
comprehensive user study, users can significantly improve the efficiency and
effectiveness of symbolic execution by writing a simple and intuitive Aes
script. Besides verifying six known real-world bugs, Eunomia also detected two
new zero-day bugs in a popular open-source project, Collections-C.Comment: Accepted by ACM SIGSOFT International Symposium on Software Testing
and Analysis (ISSTA) 202
Using knowledge graphs to infer gene expression in plants
IntroductionClimate change is already affecting ecosystems around the world and forcing us to adapt to meet societal needs. The speed with which climate change is progressing necessitates a massive scaling up of the number of species with understood genotype-environment-phenotype (G×E×P) dynamics in order to increase ecosystem and agriculture resilience. An important part of predicting phenotype is understanding the complex gene regulatory networks present in organisms. Previous work has demonstrated that knowledge about one species can be applied to another using ontologically-supported knowledge bases that exploit homologous structures and homologous genes. These types of structures that can apply knowledge about one species to another have the potential to enable the massive scaling up that is needed through in silico experimentation.MethodsWe developed one such structure, a knowledge graph (KG) using information from Planteome and the EMBL-EBI Expression Atlas that connects gene expression, molecular interactions, functions, and pathways to homology-based gene annotations. Our preliminary analysis uses data from gene expression studies in Arabidopsis thaliana and Populus trichocarpa plants exposed to drought conditions.ResultsA graph query identified 16 pairs of homologous genes in these two taxa, some of which show opposite patterns of gene expression in response to drought. As expected, analysis of the upstream cis-regulatory region of these genes revealed that homologs with similar expression behavior had conserved cis-regulatory regions and potential interaction with similar trans-elements, unlike homologs that changed their expression in opposite ways.DiscussionThis suggests that even though the homologous pairs share common ancestry and functional roles, predicting expression and phenotype through homology inference needs careful consideration of integrating cis and trans-regulatory components in the curated and inferred knowledge graph
Recommended from our members
It's about Time: Analytical Time Periodization
This paper presents a novel approach to the problem of time periodization, which involves dividing the time span of a complex dynamic phenomenon into periods that enclose different relatively stable states or development trends. The challenge lies in finding such a division of the time that takes into account diverse behaviours of multiple components of the phenomenon while being simple and easy to interpret. Despite the importance of this problem, it has not received sufficient attention in the fields of visual analytics and data science. We use a real-world example from aviation and an additional usage scenario on analysing mobility trends during the COVID-19 pandemic to develop and test an analytical workflow that combines computational and interactive visual techniques. We highlight the differences between the two cases and show how they affect the use of different techniques. Through our investigation of possible variations in the time periodization problem, we discuss the potential of our approach to be used in various applications. Our contributions include defining and investigating an earlier neglected problem type, developing a practical and reproducible approach to solving problems of this type, and uncovering potential for formalization and development of computational methods
IR Design for Application-Specific Natural Language: A Case Study on Traffic Data
In the realm of software applications in the transportation industry,
Domain-Specific Languages (DSLs) have enjoyed widespread adoption due to their
ease of use and various other benefits. With the ceaseless progress in computer
performance and the rapid development of large-scale models, the possibility of
programming using natural language in specified applications - referred to as
Application-Specific Natural Language (ASNL) - has emerged. ASNL exhibits
greater flexibility and freedom, which, in turn, leads to an increase in
computational complexity for parsing and a decrease in processing performance.
To tackle this issue, our paper advances a design for an intermediate
representation (IR) that caters to ASNL and can uniformly process
transportation data into graph data format, improving data processing
performance. Experimental comparisons reveal that in standard data query
operations, our proposed IR design can achieve a speed improvement of over
forty times compared to direct usage of standard XML format data
Derivation-Graph-Based Characterizations of Decidable Existential Rule Sets
This paper establishes alternative characterizations of very expressive
classes of existential rule sets with decidable query entailment. We consider
the notable class of greedy bounded-treewidth sets (gbts) and a new,
generalized variant, called weakly gbts (wgbts). Revisiting and building on the
notion of derivation graphs, we define (weakly) cycle-free derivation graph
sets ((w)cdgs) and employ elaborate proof-theoretic arguments to obtain that
gbts and cdgs coincide, as do wgbts and wcdgs. These novel characterizations
advance our analytic proof-theoretic understanding of existential rules and
will likely be instrumental in practice.Comment: accepted to JELIA 202
BOLD: A Benchmark for Linked Data User Agents and a Simulation Framework for Dynamic Linked Data Environments
The paper presents the BOLD (Buildings on Linked Data) benchmark for Linked
Data agents, next to the framework to simulate dynamic Linked Data
environments, using which we built BOLD. The BOLD benchmark instantiates the
BOLD framework by providing a read-write Linked Data interface to a smart
building with simulated time, occupancy movement and sensors and actuators
around lighting. On the Linked Data representation of this environment, agents
carry out several specified tasks, such as controlling illumination. The
simulation environment provides means to check for the correct execution of the
tasks and to measure the performance of agents. We conduct measurements on
Linked Data agents based on condition-action rules
A Constraint-based Recommender System via RDF Knowledge Graphs
Knowledge graphs, represented in RDF, are able to model entities and their
relations by means of ontologies. The use of knowledge graphs for information
modeling has attracted interest in recent years. In recommender systems, items
and users can be mapped and integrated into the knowledge graph, which can
represent more links and relationships between users and items.
Constraint-based recommender systems are based on the idea of explicitly
exploiting deep recommendation knowledge through constraints to identify
relevant recommendations. When combined with knowledge graphs, a
constraint-based recommender system gains several benefits in terms of
constraint sets. In this paper, we investigate and propose the construction of
a constraint-based recommender system via RDF knowledge graphs applied to the
vehicle purchase/sale domain. The results of our experiments show that the
proposed approach is able to efficiently identify recommendations in accordance
with user preferences
- …