10,350 research outputs found
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges
Measuring and evaluating source code similarity is a fundamental software
engineering activity that embraces a broad range of applications, including but
not limited to code recommendation, duplicate code, plagiarism, malware, and
smell detection. This paper proposes a systematic literature review and
meta-analysis on code similarity measurement and evaluation techniques to shed
light on the existing approaches and their characteristics in different
applications. We initially found over 10000 articles by querying four digital
libraries and ended up with 136 primary studies in the field. The studies were
classified according to their methodology, programming languages, datasets,
tools, and applications. A deep investigation reveals 80 software tools,
working with eight different techniques on five application domains. Nearly 49%
of the tools work on Java programs and 37% support C and C++, while there is no
support for many programming languages. A noteworthy point was the existence of
12 datasets related to source code similarity measurement and duplicate codes,
of which only eight datasets were publicly accessible. The lack of reliable
datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm
languages are the main challenges in the field. Emerging applications of code
similarity measurement concentrate on the development phase in addition to the
maintenance.Comment: 49 pages, 10 figures, 6 table
Visual Programming Paradigm for Organizations in Multi-Agent Systems
Over the past few years, due to a fast digitalization process, business activities witnessed the adoption of new technologies, such as Multi-Agent Systems, to increase the autonomy of their activities. However, the complexity of these technologies often hinders the capability of domain experts, who do not possess coding skills, to exploit them directly.
To take advantage of these individuals' expertise in their field, the idea of a user-friendly and accessible Integrated Development Environment arose. Indeed, efforts have already been made to develop a block-based visual programming language for software agents.
Although the latter project represents a huge step forward, it does not provide a solution for addressing complex, real-world use cases where interactions and coordination among single entities are crucial. To address this problem, Multi-Agent Oriented Programming introduces organization as a first-class abstraction for designing and implementing Multi-Agent Systems.
Therefore, this thesis aims to provide a solution allowing users to impose an organization on top of the agents easily. Since ease of use and intuitiveness remain the key points for this project, users will be able to define organizations through visual language and an intuitive development environment
Using machine learning to predict pathogenicity of genomic variants throughout the human genome
Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität.
Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores.
Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt.
Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity.
Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants.
The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency.
In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org
Endogenous measures for contextualising large-scale social phenomena: a corpus-based method for mediated public discourse
This work presents an interdisciplinary methodology for developing endogenous measures of group membership through analysis of pervasive linguistic patterns in public discourse. Focusing on political discourse, this work critiques the conventional approach to the study of political participation, which is premised on decontextualised, exogenous measures to characterise groups. Considering the theoretical and empirical weaknesses of decontextualised approaches to large-scale social phenomena, this work suggests that contextualisation using endogenous measures might provide a complementary perspective to mitigate such weaknesses.
This work develops a sociomaterial perspective on political participation in mediated discourse as affiliatory action performed through language. While the affiliatory function of language is often performed consciously (such as statements of identity), this work is concerned with unconscious features (such as patterns in lexis and grammar). This work argues that pervasive patterns in such features that emerge through socialisation are resistant to change and manipulation, and thus might serve as endogenous measures of sociopolitical contexts, and thus of groups.
In terms of method, the work takes a corpus-based approach to the analysis of data from the Twitter messaging service whereby patterns in users’ speech are examined statistically in order to trace potential community membership. The method is applied in the US state of Michigan during the second half of 2018—6 November having been the date of midterm (i.e. non-Presidential) elections in the United States. The corpus is assembled from the original posts of 5,889 users, who are nominally geolocalised to 417 municipalities. These users are clustered according to pervasive language features. Comparing the linguistic clusters according to the municipalities they represent finds that there are regular sociodemographic differentials across clusters. This is understood as an indication of social structure, suggesting that endogenous measures derived from pervasive patterns in language may indeed offer a complementary, contextualised perspective on large-scale social phenomena
Examples of works to practice staccato technique in clarinet instrument
Klarnetin staccato tekniğini güçlendirme aşamaları eser çalışmalarıyla uygulanmıştır. Staccato
geçişlerini hızlandıracak ritim ve nüans çalışmalarına yer verilmiştir. Çalışmanın en önemli amacı
sadece staccato çalışması değil parmak-dilin eş zamanlı uyumunun hassasiyeti üzerinde de
durulmasıdır. Staccato çalışmalarını daha verimli hale getirmek için eser çalışmasının içinde etüt
çalışmasına da yer verilmiştir. Çalışmaların üzerinde titizlikle durulması staccato çalışmasının ilham
verici etkisi ile müzikal kimliğe yeni bir boyut kazandırmıştır. Sekiz özgün eser çalışmasının her
aşaması anlatılmıştır. Her aşamanın bir sonraki performans ve tekniği güçlendirmesi esas alınmıştır.
Bu çalışmada staccato tekniğinin hangi alanlarda kullanıldığı, nasıl sonuçlar elde edildiği bilgisine
yer verilmiştir. Notaların parmak ve dil uyumu ile nasıl şekilleneceği ve nasıl bir çalışma disiplini
içinde gerçekleşeceği planlanmıştır. Kamış-nota-diyafram-parmak-dil-nüans ve disiplin
kavramlarının staccato tekniğinde ayrılmaz bir bütün olduğu saptanmıştır. Araştırmada literatür
taraması yapılarak staccato ile ilgili çalışmalar taranmıştır. Tarama sonucunda klarnet tekniğin de
kullanılan staccato eser çalışmasının az olduğu tespit edilmiştir. Metot taramasında da etüt
çalışmasının daha çok olduğu saptanmıştır. Böylelikle klarnetin staccato tekniğini hızlandırma ve
güçlendirme çalışmaları sunulmuştur. Staccato etüt çalışmaları yapılırken, araya eser çalışmasının
girmesi beyni rahatlattığı ve istekliliği daha arttırdığı gözlemlenmiştir. Staccato çalışmasını yaparken
doğru bir kamış seçimi üzerinde de durulmuştur. Staccato tekniğini doğru çalışmak için doğru bir
kamışın dil hızını arttırdığı saptanmıştır. Doğru bir kamış seçimi kamıştan rahat ses çıkmasına
bağlıdır. Kamış, dil atma gücünü vermiyorsa daha doğru bir kamış seçiminin yapılması gerekliliği
vurgulanmıştır. Staccato çalışmalarında baştan sona bir eseri yorumlamak zor olabilir. Bu açıdan
çalışma, verilen müzikal nüanslara uymanın, dil atış performansını rahatlattığını ortaya koymuştur.
Gelecek nesillere edinilen bilgi ve birikimlerin aktarılması ve geliştirici olması teşvik edilmiştir.
Çıkacak eserlerin nasıl çözüleceği, staccato tekniğinin nasıl üstesinden gelinebileceği anlatılmıştır.
Staccato tekniğinin daha kısa sürede çözüme kavuşturulması amaç edinilmiştir. Parmakların
yerlerini öğrettiğimiz kadar belleğimize de çalışmaların kaydedilmesi önemlidir. Gösterilen azmin ve
sabrın sonucu olarak ortaya çıkan yapıt başarıyı daha da yukarı seviyelere çıkaracaktır
Architecture Smells vs. Concurrency Bugs: an Exploratory Study and Negative Results
Technical debt occurs in many different forms across software artifacts. One
such form is connected to software architectures where debt emerges in the form
of structural anti-patterns across architecture elements, namely, architecture
smells. As defined in the literature, ``Architecture smells are recurrent
architectural decisions that negatively impact internal system quality", thus
increasing technical debt. In this paper, we aim at exploring whether there
exist manifestations of architectural technical debt beyond decreased code or
architectural quality, namely, whether there is a relation between architecture
smells (which primarily reflect structural characteristics) and the occurrence
of concurrency bugs (which primarily manifest at runtime). We study 125
releases of 5 large data-intensive software systems to reveal that (1) several
architecture smells may in fact indicate the presence of concurrency problems
likely to manifest at runtime but (2) smells are not correlated with
concurrency in general -- rather, for specific concurrency bugs they must be
combined with an accompanying articulation of specific project characteristics
such as project distribution. As an example, a cyclic dependency could be
present in the code, but the specific execution-flow could be never executed at
runtime
Analysis of ChatGPT on Source Code
This paper explores the use of Large Language Models (LLMs) and in particular
ChatGPT in programming, source code analysis, and code generation. LLMs and
ChatGPT are built using machine learning and artificial intelligence
techniques, and they offer several benefits to developers and programmers.
While these models can save time and provide highly accurate results, they are
not yet advanced enough to replace human programmers entirely. The paper
investigates the potential applications of LLMs and ChatGPT in various areas,
such as code creation, code documentation, bug detection, refactoring, and
more. The paper also suggests that the usage of LLMs and ChatGPT is expected to
increase in the future as they offer unparalleled benefits to the programming
community.Comment: 40 pages, examples provided for each experimen
Bibliographic Control in the Digital Ecosystem
With the contributions of international experts, the book aims to explore the new boundaries of universal bibliographic control. Bibliographic control is radically changing because the bibliographic universe is radically changing: resources, agents, technologies, standards and practices. Among the main topics addressed: library cooperation networks; legal deposit; national bibliographies; new tools and standards (IFLA LRM, RDA, BIBFRAME); authority control and new alliances (Wikidata, Wikibase, Identifiers); new ways of indexing resources (artificial intelligence); institutional repositories; new book supply chain; “discoverability” in the IIIF digital ecosystem; role of thesauri and ontologies in the digital ecosystem; bibliographic control and search engines
Technologies and Applications for Big Data Value
This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems
Mobile Forensics – The File Format Handbook
This open access book summarizes knowledge about several file systems and file formats commonly used in mobile devices. In addition to the fundamental description of the formats, there are hints about the forensic value of possible artefacts, along with an outline of tools that can decode the relevant data. The book is organized into two distinct parts: Part I describes several different file systems that are commonly used in mobile devices. · APFS is the file system that is used in all modern Apple devices including iPhones, iPads, and even Apple Computers, like the MacBook series. · Ext4 is very common in Android devices and is the successor of the Ext2 and Ext3 file systems that were commonly used on Linux-based computers. · The Flash-Friendly File System (F2FS) is a Linux system designed explicitly for NAND Flash memory, common in removable storage devices and mobile devices, which Samsung Electronics developed in 2012. · The QNX6 file system is present in Smartphones delivered by Blackberry (e.g. devices that are using Blackberry 10) and modern vehicle infotainment systems that use QNX as their operating system. Part II describes five different file formats that are commonly used on mobile devices. · SQLite is nearly omnipresent in mobile devices with an overwhelming majority of all mobile applications storing their data in such databases. · The second leading file format in the mobile world are Property Lists, which are predominantly found on Apple devices. · Java Serialization is a popular technique for storing object states in the Java programming language. Mobile application (app) developers very often resort to this technique to make their application state persistent. · The Realm database format has emerged over recent years as a possible successor to the now ageing SQLite format and has begun to appear as part of some modern applications on mobile devices. · Protocol Buffers provide a format for taking compiled data and serializing it by turning it into bytes represented in decimal values, which is a technique commonly used in mobile devices. The aim of this book is to act as a knowledge base and reference guide for digital forensic practitioners who need knowledge about a specific file system or file format. It is also hoped to provide useful insight and knowledge for students or other aspiring professionals who want to work within the field of digital forensics. The book is written with the assumption that the reader will have some existing knowledge and understanding about computers, mobile devices, file systems and file formats
- …