25 research outputs found
Classifying Web Exploits with Topic Modeling
This short empirical paper investigates how well topic modeling and database
meta-data characteristics can classify web and other proof-of-concept (PoC)
exploits for publicly disclosed software vulnerabilities. By using a dataset
comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is
obtained in the empirical experiment. Text mining and topic modeling are a
significant boost factor behind this classification performance. In addition to
these empirical results, the paper contributes to the research tradition of
enhancing software vulnerability information with text mining, providing also a
few scholarly observations about the potential for semi-automatic
classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and
Expert Systems Applications (DEXA).
http://ieeexplore.ieee.org/abstract/document/8049693
The Python user interface of the elsA cfd software: a coupling framework for external steering layers
The Python--elsA user interface of the elsA cfd (Computational Fluid
Dynamics) software has been developed to allow users to specify simulations
with confidence, through a global context of description objects grouped inside
scripts. The software main features are generated documentation, context
checking and completion, and helpful error management. Further developments
have used this foundation as a coupling framework, allowing (thanks to the
descriptive approach) the coupling of external algorithms with the cfd solver
in a simple and abstract way, leading to more success in complex simulations.
Along with the description of the technical part of the interface, we try to
gather the salient points pertaining to the psychological viewpoint of user
experience (ux). We point out the differences between user interfaces and pure
data management systems such as cgns
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Statistical language modeling techniques have successfully been applied to
large source code corpora, yielding a variety of new software development
tools, such as tools for code suggestion, improving readability, and API
migration. A major issue with these techniques is that code introduces new
vocabulary at a far higher rate than natural language, as new identifier names
proliferate. Both large vocabularies and out-of-vocabulary issues severely
affect Neural Language Models (NLMs) of source code, degrading their
performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling
choices impact the resulting vocabulary on a large-scale corpus of 13,362
projects; 2) presenting an open vocabulary source code NLM that can scale to
such a corpus, 100 times larger than in previous work; and 3) showing that such
models outperform the state of the art on three distinct code corpora (Java, C,
Python). To our knowledge, these are the largest NLMs for code that have been
reported.
All datasets, code, and trained models used in this work are publicly
available.Comment: 13 pages; to appear in Proceedings of ICSE 202