179 research outputs found
Deobfuscating Name Scrambling as a Natural Language Generation Task
We are interested in data-driven approaches to Natural Language Generation, but semantic representations for human text are difficult and expensive to construct. By considering a methods implementation as weak semantics for the English terms extracted from the method’s name we can collect massive datasets, akin to have words and sensor data aligned at a scale never seen before. We applied our learned model to name scrambling, a common technique used to protect intellectual property and increase the effort necessary to reverse engineer Java binary code: replacing all the method and class names by a random identifier.
Using 5.6M bytecode-compiled Java methods obtained from the Debian archive, we trained a Random Forest model to predict the first term in the method name. As features, we use primarily the opcodes of the bytecodes (that is, bytecodes without any parameters). Our results indicate that we can distinguish the 15 most popular terms from the others at 78% recall, helping a programmer performing reverse engineering to reduce half of the methods in a program they should further investigate.Sociedad Argentina de Informática e Investigación Operativ
Deobfuscating Name Scrambling as a Natural Language Generation Task
We are interested in data-driven approaches to Natural Language Generation, but semantic representations for human text are difficult and expensive to construct. By considering a methods implementation as weak semantics for the English terms extracted from the method’s name we can collect massive datasets, akin to have words and sensor data aligned at a scale never seen before. We applied our learned model to name scrambling, a common technique used to protect intellectual property and increase the effort necessary to reverse engineer Java binary code: replacing all the method and class names by a random identifier.
Using 5.6M bytecode-compiled Java methods obtained from the Debian archive, we trained a Random Forest model to predict the first term in the method name. As features, we use primarily the opcodes of the bytecodes (that is, bytecodes without any parameters). Our results indicate that we can distinguish the 15 most popular terms from the others at 78% recall, helping a programmer performing reverse engineering to reduce half of the methods in a program they should further investigate.Sociedad Argentina de Informática e Investigación Operativ
Deobfuscating Name Scrambling as a Natural Language Generation Task
We are interested in data-driven approaches to Natural Language Generation, but semantic representations for human text are difficult and expensive to construct. By considering a methods implementation as weak semantics for the English terms extracted from the method’s name we can collect massive datasets, akin to have words and sensor data aligned at a scale never seen before. We applied our learned model to name scrambling, a common technique used to protect intellectual property and increase the effort necessary to reverse engineer Java binary code: replacing all the method and class names by a random identifier.
Using 5.6M bytecode-compiled Java methods obtained from the Debian archive, we trained a Random Forest model to predict the first term in the method name. As features, we use primarily the opcodes of the bytecodes (that is, bytecodes without any parameters). Our results indicate that we can distinguish the 15 most popular terms from the others at 78% recall, helping a programmer performing reverse engineering to reduce half of the methods in a program they should further investigate.Sociedad Argentina de Informática e Investigación Operativ
Automatic multi-label categorization of Java applications using Dependency graphs
Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users.Resumen. La categorización automática de repositorios de software ha ido ganando aceptación debido a que reduce el esfuerzo manual y puede generar resultados de alta calidad. La mayoría de los modelos existentes dependen fuertemente del aprendizaje de máquina supervisado { que necesita de un conjunto predefinido de categorías para ser usado como datos de entrenamiento{ y han usado código fuente, comentarios, llamadas de API y otras fuentes para obtener información sobre los proyectos a categorizar. Consideramos que los modelos existentes tienen debilidades que pueden tener implicaciones importantes en el resultado de la categorización y no han sido resueltas al mismo tiempo, específicamente la suposición de que el código fuente de los proyectos se encuentra completamente disponible y la necesidad de conjuntos predefinidos de categorías. Por esto, presentamos el modelo Sally: Un enfoque de categorización automática de software novedoso, no supervisado y multi-etiqueta que es capaz de generar categorías descriptivas sin depender del acceso al código fuente ni a categorías predefinidas usando información obtenida de los proyectos a categorizar y las relaciones entre ellos. Realizamos dos experimentos en los que comparamos a Sally con las estrategias de categorización automática de dos herramientas online ámpliamente utilizadas y con MUDABlue, un modelo de categorización automática de software propuesto por Kawaguchi et al. que consideramos una buena base de comparación. Adicionalmente, evaluamos el modelo propuesto por medio de un caso de estudio llevado a cabo con la participación de 14 desarrolladores con un ámplio rango de experiencia en programación y desarrollamos una aplicación web para poner el modelo propuesto a disposición de usuarios potenciales.Maestrí
Smart contracts categorization with topic modeling techniques
One of the main advantages of the Ethereum blockchain is the possibility of developing smart contracts in a Turing complete environment. These general-purpose programs provide a higher level of security than traditional contracts and reduce other transaction costs associated with the bargaining practice.
Developers use smart contracts to build their tokens and set up gambling games,
crowdsales, ICO, and many others. Since the number of smart contracts inside the Ethereum blockchain is several million, it is unthinkable to check every program manually to understand its functionality. At the same time, it would be of primary importance to group sets of Smart Contracts according to their purposes and
functionalities. One way to group Ethereum’s smart contracts is to use topic
modeling techniques, taking advantage of the fact that many programs representing a specific topic are similar in the program structure. Starting from a dataset of
130k smart contracts, we built a Latent Dirichlet Allocation (LDA) model to spot
the number of topics within our sample. Computing the coherence values for a
different number of topics, we found out that the optimal number was 15. As we
expected, most programs are tokens, games, crowdfunding platforms, and ICO
Antipatterns in Software Classification Taxonomies
Empirical results in software engineering have long started to show that
findings are unlikely to be applicable to all software systems, or any domain:
results need to be evaluated in specified contexts, and limited to the type of
systems that they were extracted from. This is a known issue, and requires the
establishment of a classification of software types.
This paper makes two contributions: the first is to evaluate the quality of
the current software classifications landscape. The second is to perform a case
study showing how to create a classification of software types using a curated
set of software systems.
Our contributions show that existing, and very likely even new,
classification attempts are deemed to fail for one or more issues, that we
named as the `antipatterns' of software classification tasks. We collected 7 of
these antipatterns that emerge from both our case study, and the existing
classifications.
These antipatterns represent recurring issues in a classification, so we
discuss practical ways to help researchers avoid these pitfalls. It becomes
clear that classification attempts must also face the daunting task of
formulating a taxonomy of software types, with the objective of establishing a
hierarchy of categories in a classification.Comment: Accepted for publish at the Journal of Systems and Softwar
Advances in Cybercrime Prediction: A Survey of Machine, Deep, Transfer, and Adaptive Learning Techniques
Cybercrime is a growing threat to organizations and individuals worldwide,
with criminals using increasingly sophisticated techniques to breach security
systems and steal sensitive data. In recent years, machine learning, deep
learning, and transfer learning techniques have emerged as promising tools for
predicting cybercrime and preventing it before it occurs. This paper aims to
provide a comprehensive survey of the latest advancements in cybercrime
prediction using above mentioned techniques, highlighting the latest research
related to each approach. For this purpose, we reviewed more than 150 research
articles and discussed around 50 most recent and relevant research articles. We
start the review by discussing some common methods used by cyber criminals and
then focus on the latest machine learning techniques and deep learning
techniques, such as recurrent and convolutional neural networks, which were
effective in detecting anomalous behavior and identifying potential threats. We
also discuss transfer learning, which allows models trained on one dataset to
be adapted for use on another dataset, and then focus on active and
reinforcement Learning as part of early-stage algorithmic research in
cybercrime prediction. Finally, we discuss critical innovations, research gaps,
and future research opportunities in Cybercrime prediction. Overall, this paper
presents a holistic view of cutting-edge developments in cybercrime prediction,
shedding light on the strengths and limitations of each method and equipping
researchers and practitioners with essential insights, publicly available
datasets, and resources necessary to develop efficient cybercrime prediction
systems.Comment: 27 Pages, 6 Figures, 4 Table
Using multiclass classification algorithms to improve text categorization tool:NLoN
Abstract. Natural language processing (NLP) and machine learning techniques have been widely utilized in the mining software repositories (MSR) field in recent years. Separating natural language from source code is a pre-processing step that is needed in both NLP and the MSR domain for better data quality. This paper presents the design and implementation of a multi-class classification approach that is based on the existing open-source R package Natural Language or Not (NLoN).
This article also reviews the existing literature on MSR and NLP. The review classified the information sources and approaches of MSR in detail, and also focused on the text representation and classification tasks of NLP. In addition, the design and implementation methods of the original paper are briefly introduced.
Regarding the research methodology, since the research goal is technology-oriented, i.e., to improve the design and implementation of existing technologies, this article adopts the design science research methodology and also describes how the methodology was adopted.
This research implements an open-source Python library, namely NLoN-PY. This is an open-source library hosted on GitHub, and users can also directly use the tools published to the PyPI library.
Since NLoN has achieved comparable performance on two-class classification tasks with the Lasso regression model, this study evaluated other multi-class classification algorithms, i.e., Naive Bayes, k-Nearest Neighbours, and Support Vector Machine. Using 10-fold cross-validation, the expanded classifier achieved AUC performance of 0.901 for the 5-class classification task and the AUC performance of 0.92 for the 2-class task.
Although the design of this study did not show a significant performance improvement compared to the original design, the impact of unbalanced data distribution on performance was detected and the category of the classification problem was also refined in the process. These findings on the multi-class classification design can provide a research foundation or direction for future research
- …