1,984 research outputs found
Neural-Augmented Static Analysis of Android Communication
We address the problem of discovering communication links between
applications in the popular Android mobile operating system, an important
problem for security and privacy in Android. Any scalable static analysis in
this complex setting is bound to produce an excessive amount of
false-positives, rendering it impractical. To improve precision, we propose to
augment static analysis with a trained neural-network model that estimates the
probability that a communication link truly exists. We describe a
neural-network architecture that encodes abstractions of communicating objects
in two applications and estimates the probability with which a link indeed
exists. At the heart of our architecture are type-directed encoders (TDE), a
general framework for elegantly constructing encoders of a compound data type
by recursively composing encoders for its constituent types. We evaluate our
approach on a large corpus of Android applications, and demonstrate that it
achieves very high accuracy. Further, we conduct thorough interpretability
studies to understand the internals of the learned neural networks.Comment: Appears in Proceedings of the 2018 ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE
Deep Learning Software Repositories
Bridging the abstraction gap between artifacts and concepts is the essence of software engineering (SE) research problems. SE researchers regularly use machine learning to bridge this gap, but there are three fundamental issues with traditional applications of machine learning in SE research. Traditional applications are too reliant on labeled data. They are too reliant on human intuition, and they are not capable of learning expressive yet efficient internal representations. Ultimately, SE research needs approaches that can automatically learn representations of massive, heterogeneous, datasets in situ, apply the learned features to a particular task and possibly transfer knowledge from task to task. Improvements in both computational power and the amount of memory in modern computer architectures have enabled new approaches to canonical machine learning tasks. Specifically, these architectural advances have enabled machines that are capable of learning deep, compositional representations of massive data depots. The rise of deep learning has ushered in tremendous advances in several fields. Given the complexity of software repositories, we presume deep learning has the potential to usher in new analytical frameworks and methodologies for SE research and the practical applications it reaches. This dissertation examines and enables deep learning algorithms in different SE contexts. We demonstrate that deep learners significantly outperform state-of-the-practice software language models at code suggestion on a Java corpus. Further, these deep learners for code suggestion automatically learn how to represent lexical elements. We use these representations to transmute source code into structures for detecting similar code fragments at different levels of granularity—without declaring features for how the source code is to be represented. Then we use our learning-based framework for encoding fragments to intelligently select and adapt statements in a codebase for automated program repair. In our work on code suggestion, code clone detection, and automated program repair, everything for representing lexical elements and code fragments is mined from the source code repository. Indeed, our work aims to move SE research from the art of feature engineering to the science of automated discovery
Software weaknesses detection using static-code analysis and machine learning techniques
Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de ComputadoresA indústria de software desempenha um papel essencial no mundo moderno em quase todos os domÃnios. As vulnerabilidades são predominantes nos sistemas de software e podem resultar num impacto negativo na segurança informática. Embora existam ferramentas para detetar códigos vulneráveis, sua precisão e eficácia ainda é uma questão de pesquisa desafiante. Para definir mecanismos que identificam vulnerabilidades, muitas soluções existentes requerem trabalho árduo dos especialistas. O constante aumento do número de vulnerabilidades reveladas tornou-se uma preocupação importante na indústria de software e no campo da cibersegurança, o que implica que as atuais abordagens para a deteção de vulnerabilidades exigem melhorias adicionais. Isso tem motivado investigadores nas comunidades de engenharia de software e segurança cibernética a aplicar aprendizagem automática para reconhecimento de padrões e caracterÃsticas de códigos vulneráveis. Seguindo esta linha de pesquisa, este trabalho apresenta um sistema de deteção de vulnerabilidades baseado em aprendizagem automática que usa análise estática de código para extrair dependências no código e construir o conjunto de dados a partir destes. A dataset foi recolhida a partir da National Vulnerability Database (NVD) e o SAMATE. A dataset contém códigos fonte Java com as vulnerabilidades Null pointer deference e command injections como alvos selecionados para caso de estudo. A Control Flow Graph (CFG) foi utilizada em conjunto com as técnicas de análise estática de código para extração de caracteristicas. Os resultados experimentais demonstram que nossa ferramenta pode alcançar significativamente menos falsos negativos (com um número razoável de falsos positivos) em comparação com outras abordagens. Além disso, aplicamos a ferramenta a produtos de software reais e fomos capazes de identificar vulnerabilidades, apesar do número de falsos positivos.Software industry plays an essential role in modern world in almost all fields. Vulnerabilities are predominant in software systems and can result in a negative impact to the computer security. Although there are tools to detect vulnerable code, their accuracy and efficacy is still a challenging research question. To define features that identify vulnerabilities, many existing solutions require hard work from human experts. The constant increasing number of revealed security vulnerabilities have become an importante concern in the software industry and in the field of cybersecurity, implying that the current approaches for vulnerability detection demand further improvement. This has motivated researchers in the software engineering and cybersecurity communities to apply machine learning for patterns recognition and characteristics of vulnerable code. Following this research line, this work presents a machine learning based vulnerability detection system that uses static-code analysis to extract dependencies in the code and build data features from these. The dataset was collected from the National Vulnerability Database (NVD) and test cases NIST SAMATE project and contains Java code as selected target programming language with Null pointer deference and command injections vulnerabilities as selected weaknesses. The data samples were generated from the source code of the vulnerable files by utilizing a control flow graph (CFG) to extract features. Data-flow analysis techniques were also used for feature extraction. Experimental results demonstrate that our tool can achieve significantly fewer false negatives (with a reasonable number of false positives) compared to other approaches. We further applied the tool to real software products and were able to identify vulnerabilities, despite the number of false positives.N/
Recommended from our members
Combining Static and Dynamic Analysis for Bug Detection and Program Understanding
This work proposes new combinations of static and dynamic analysis for bug detection and program understanding. There are 3 related but largely independent directions: a) In the area of dynamic invariant inference, we improve the consistency of dynamically discovered invariants by taking into account second-order constraints that encode knowledge aboutinvariants; the second-order constraints are either supplied by the programmer or vetted by the programmer (among candidate constraints suggested automatically); b) In the area of testing dataflow (esp. map-reduce) programs, our tool, SEDGE, achieves higher testing coverage by leveraging existinginput data and generalizing them using a symbolic reasoning engine (a powerful SMT solver); c) In the area of bug detection, we identify and present the concept of residual investigation: a dynamic analysis that serves as theruntime agent of a static analysis. Residual investigation identifies with higher certainty whether an error reported by the static analysis is likely true
Recommended from our members
Spreadsheet Tools for Data Analysts
Spreadsheets are a natural fit for data analysis, combining a simple data storage and presentation layer with a programming language and basic debugging tools. Because spreadsheets are accessible and flexible, they are used by both novices and experts. Consequently, spreadsheets are hugely popular, with more than 750 million copies of Microsoft Excel installed worldwide. This popularity means that spreadsheets are the most popular programming language on the planet and the de facto tool for data analysis.
Nevertheless, spreadsheets do not address a number of important tasks in a typical analyst\u27s pipeline, and their design frequently complicates them. This thesis describes three key challenges for analysts using spreadsheets. 1) Data wrangling is the process of converting or mapping data from a raw form into another form suitable for use with automated tools. 2) Data cleaning is the process of locating and correcting omitted or erroneous data. 3) Formula auditing is the process of finding and correcting spreadsheet program errors. These three tasks combined are estimated to occupy more than three quarters of a data analyst\u27s time. Furthermore, errors not caught during these steps have led to catastrophically bad decisions resulting in billions of dollars in losses. Advances in automated techniques for these tasks may result in dramatic savings in both time and money.
Three novel programming language-based techniques were created to address these key tasks. The first, automatic layout transformation using examples, is a program synthesis-based technique that lets spreadsheet users perform data wrangling tasks automatically, at scale, and without programming. The second, data debugging, is technique for data cleaning that combines program analysis and statistical analysis to automatically find likely data errors. The third, spatio-structural program analysis unifies positional and dependence information and finds spreadsheet errors using a kind of anomaly analysis.
Each technique was implemented as an end-user tool---FlaskRelate, CheckCell, and ExceLint respectively---in the form of a point-and-click plugin for Microsoft Excel. Our evaluation demonstrates that these techniques substantially improve user efficiency. Finally, because these tools build on each other in a complementary fashion, data analysts can run data wrangling, cleaning, and formula auditing tasks together in a single analysis pipeline
Man-machine partial program analysis for malware detection
With the meteoric rise in popularity of the Android platform, there is an urgent need to combat the accompanying proliferation of malware. Existing work addresses the area of consumer malware detection, but cannot detect novel, sophisticated, domain-specific malware that is targeted specifically at one aspect of an organization (eg. ground operations of the US Military). Adversaries can exploit domain knowledge to camoflauge malice within the legitimate behaviors of an app and behind a domain-specific trigger, rendering traditional approaches such as signature-matching, machine learning, and dynamic monitoring ineffective. Manual code inspections are also inadequate, scaling poorly and introducing human error. Yet, there is a dire need to detect this kind of malware before it causes catastrophic loss of life and property.
This dissertation presents the Security Toolbox, our novel solution for this challenging new problem posed by DARPA\u27s Automated Program Analysis for Cybersecurity (APAC) program. We employ a human-in-the-loop approach to amplify the natural intelligence of our analysts. Our automation detects interesting program behaviors and exposes them in an analysis Dashboard, allowing the analyst to brainstorm flaw hypotheses and ask new questions, which in turn can be answered by our automated analysis primitives. The Security Toolbox is built on top of Atlas, a novel program analysis platform made by EnSoft. Atlas uses a graph-based mathematical abstraction of software to produce a unified property multigraph, exposes a powerful API for writing analyzers using graph traversals, and provides both automated and interactive capabilities to facilitate program comprehension. The Security Toolbox is also powered by FlowMiner, a novel solution to mine fine-grained, compact data flow summaries of Java libraries. FlowMiner allows the Security Toolbox to complete a scalable and accurate partial program analysis of an application without including all of the libraries that it uses (eg. Android).
This dissertation presents the Security Toolbox, Atlas, and FlowMiner. We provide empirical evidence of the effectiveness of the Security Toolbox for detecting novel, sophisticated, domain-specific Android malware, demonstrating that our approach outperforms other cutting-edge research tools and state-of-the-art commercial programs in both time and accuracy metrics. We also evaluate the effectiveness of Atlas as a program analysis platform and FlowMiner as a library summary tool
- …