Search CORE

630 research outputs found

"Influence Sketching": Finding Influential Samples In Large-Scale Regressions

Author: Crable Caleb
Cruz Ben
Luan Jay
Wallace Brian
Wojnowicz Mike
Wolff Matt
Zhao Xuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/03/2017
Field of study

There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo

arXiv.org e-Print Archive

Active Learning of Points-To Specifications

Author: Ali Karim
Feng Yu
Fuchs Adam P
Heule Stefan
Kodumal John
Publication venue
Publication date: 21/05/2018
Field of study

When analyzing programs, large libraries pose significant challenges to static points-to analysis. A popular solution is to have a human analyst provide points-to specifications that summarize relevant behaviors of library code, which can substantially improve precision and handle missing code such as native code. We propose ATLAS, a tool that automatically infers points-to specifications. ATLAS synthesizes unit tests that exercise the library code, and then infers points-to specifications based on observations from these executions. ATLAS automatically infers specifications for the Java standard library, and produces better results for a client static information flow analysis on a benchmark of 46 Android apps compared to using existing handwritten specifications

arXiv.org e-Print Archive

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion

Author: Bauer Lujo
Huang Zhuoqun
Lucas Keane
Marchant Neil G.
Ohrimenko Olga
Rubinstein Benjamin I. P.
Publication venue
Publication date: 26/10/2023
Field of study

Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where

\ell_p

-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.Comment: To be published in NeurIPS 2023. 36 pages, 7 figures, 12 tables. Includes 20 pages of appendice

arXiv.org e-Print Archive

Host-based detection and analysis of Android malware: implication for privilege exploitation

Author: Ashawa Moses
Morris Sarah
Publication venue: 'Infonomics Society'
Publication date: 30/06/2019
Field of study

The Rapid expansion of mobile Operating Systems has created a proportional development in Android malware infection targeting Android which is the most widely used mobile OS. factors such Android open source platform, low-cost influence the interest of malware writers targeting this mobile OS. Though there are a lot of anti-virus programs for malware detection designed with varying degrees of signatures for this purpose, many don’t give analysis of what the malware does. Some anti-virus engines give clearance during installations of repackaged malicious applications without detection. This paper collected 28 Android malware family samples with a total of 163 sample dataset. A general analysis of the entire sample dataset was created given credence to their individual family samples and year discovered. A general detection and classification of the Android malware corpus was performed using K-means clustering algorithm. Detection rules were written with five major functions for automatic scanning, signature enablement, quarantine and reporting the scan results. The LMD was able to scan a file size of 2048mb and report accurately whether the file is benign or malicious. The K-means clustering algorithm used was set to 5 iteration training phases and was able to classify accurately the malware corpus into benign and malicious files. The obtained result shows that some Android families exploit potential privileges on mobile devices. Information leakage from the victim’s device without consent and payload deposits are some of the results obtained. The result calls proactive measures rather than proactive in tackling malware infection on Android based mobile devices