1,326 research outputs found
Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts
Most of the JavaScript code deployed in the wild has been minified, a process
in which identifier names are replaced with short, arbitrary and meaningless
names. Minified code occupies less space, but also makes the code extremely
difficult to manually inspect and understand. This paper presents Context2Name,
a deep learningbased technique that partially reverses the effect of
minification by predicting natural identifier names for minified names. The
core idea is to predict from the usage context of a variable a name that
captures the meaning of the variable. The approach combines a lightweight,
token-based static analysis with an auto-encoder neural network that summarizes
usage contexts and a recurrent neural network that predict natural names for a
given usage context. We evaluate Context2Name with a large corpus of real-world
JavaScript code and show that it successfully predicts 47.5% of all minified
identifiers while taking only 2.9 milliseconds on average to predict a name. A
comparison with the state-of-the-art tools JSNice and JSNaughty shows that our
approach performs comparably in terms of accuracy while improving in terms of
efficiency. Moreover, Context2Name complements the state-of-the-art by
predicting 5.3% additional identifiers that are missed by both existing tools
Impact Of Content Features For Automatic Online Abuse Detection
Online communities have gained considerable importance in recent years due to
the increasing number of people connected to the Internet. Moderating user
content in online communities is mainly performed manually, and reducing the
workload through automatic methods is of great financial interest for community
maintainers. Often, the industry uses basic approaches such as bad words
filtering and regular expression matching to assist the moderators. In this
article, we consider the task of automatically determining if a message is
abusive. This task is complex since messages are written in a non-standardized
way, including spelling errors, abbreviations, community-specific codes...
First, we evaluate the system that we propose using standard features of online
messages. Then, we evaluate the impact of the addition of pre-processing
strategies, as well as original specific features developed for the community
of an online in-browser strategy game. We finally propose to analyze the
usefulness of this wide range of features using feature selection. This work
can lead to two possible applications: 1) automatically flag potentially
abusive messages to draw the moderator's attention on a narrow subset of
messages ; and 2) fully automate the moderation process by deciding whether a
message is abusive without any human intervention
Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs
Binary code analysis allows analyzing binary code without having access to
the corresponding source code. A binary, after disassembly, is expressed in an
assembly language. This inspires us to approach binary analysis by leveraging
ideas and techniques from Natural Language Processing (NLP), a rich area
focused on processing text of various natural languages. We notice that binary
code analysis and NLP share a lot of analogical topics, such as semantics
extraction, summarization, and classification. This work utilizes these ideas
to address two important code similarity comparison problems. (I) Given a pair
of basic blocks for different instruction set architectures (ISAs), determining
whether their semantics is similar or not; and (II) given a piece of code of
interest, determining if it is contained in another piece of assembly code for
a different ISA. The solutions to these two problems have many applications,
such as cross-architecture vulnerability discovery and code plagiarism
detection. We implement a prototype system INNEREYE and perform a comprehensive
evaluation. A comparison between our approach and existing approaches to
Problem I shows that our system outperforms them in terms of accuracy,
efficiency and scalability. And the case studies utilizing the system
demonstrate that our solution to Problem II is effective. Moreover, this
research showcases how to apply ideas and techniques from NLP to large-scale
binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium
201
Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase
[EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve
considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321
Using Botnet Technologies to Counteract Network Traffic Analysis
Botnets have been problematic for over a decade. They are used to launch malicious activities including DDoS (Distributed-Denial-of-Service), spamming, identity theft, unauthorized bitcoin mining and malware distribution. A recent nation-wide DDoS attacks caused by the Mirai botnet on 10/21/2016 involving 10s of millions of IP addresses took down Twitter, Spotify, Reddit, The New York Times, Pinterest, PayPal and other major websites. In response to take-down campaigns by security personnel, botmasters have developed technologies to evade detection. The most widely used evasion technique is DNS fast-flux, where the botmaster frequently changes the mapping between domain names and IP addresses of the C&C server so that it will be too late or too costly to trace the C&C server locations. Domain names generated with Domain Generation Algorithms (DGAs) are used as the \u27rendezvous\u27 points between botmasters and bots. This work focuses on how to apply botnet technologies (fast-flux and DGA) to counteract network traffic analysis, therefore protecting user privacy. A better understanding of botnet technologies also helps us be pro-active in defending against botnets. First, we proposed two new DGAs using hidden Markov models (HMMs) and Probabilistic Context-Free Grammars (PCFGs) which can evade current detection methods and systems. Also, we developed two HMM-based DGA detection methods that can detect the botnet DGA-generated domain names with/without training sets. This helps security personnel understand the botnet phenomenon and develop pro-active tools to detect botnets. Second, we developed a distributed proxy system using fast-flux to evade national censorship and surveillance. The goal is to help journalists, human right advocates and NGOs in West Africa to have a secure and free Internet. Then we developed a covert data transport protocol to transform arbitrary message into real DNS traffic. We encode the message into benign-looking domain names generated by an HMM, which represents the statistical features of legitimate domain names. This can be used to evade Deep Packet Inspection (DPI) and protect user privacy in a two-way communication. Both applications serve as examples of applying botnet technologies to legitimate use. Finally, we proposed a new protocol obfuscation technique by transforming arbitrary network protocol into another (Network Time Protocol and a video game protocol of Minecraft as examples) in terms of packet syntax and side-channel features (inter-packet delay and packet size). This research uses botnet technologies to help normal users have secure and private communications over the Internet. From our botnet research, we conclude that network traffic is a malleable and artificial construct. Although existing patterns are easy to detect and characterize, they are also subject to modification and mimicry. This means that we can construct transducers to make any communication pattern look like any other communication pattern. This is neither bad nor good for security. It is a fact that we need to accept and use as best we can
- …