5,256 research outputs found
Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection
In this paper, we introduce and evaluate PROPEDEUTICA, a novel methodology
and framework for efficient and effective real-time malware detection,
leveraging the best of conventional machine learning (ML) and deep learning
(DL) algorithms. In PROPEDEUTICA, all software processes in the system start
execution subjected to a conventional ML detector for fast classification. If a
piece of software receives a borderline classification, it is subjected to
further analysis via more performance expensive and more accurate DL methods,
via our newly proposed DL algorithm DEEPMALWARE. Further, we introduce delays
to the execution of software subjected to deep learning analysis as a way to
"buy time" for DL analysis and to rate-limit the impact of possible malware in
the system. We evaluated PROPEDEUTICA with a set of 9,115 malware samples and
877 commonly used benign software samples from various categories for the
Windows OS. Our results show that the false positive rate for conventional ML
methods can reach 20%, and for modern DL methods it is usually below 6%.
However, the classification time for DL can be 100X longer than conventional ML
methods. PROPEDEUTICA improved the detection F1-score from 77.54% (conventional
ML method) to 90.25%, and reduced the detection time by 54.86%. Further, the
percentage of software subjected to DL analysis was approximately 40% on
average. Further, the application of delays in software subjected to ML reduced
the detection time by approximately 10%. Finally, we found and discussed a
discrepancy between the detection accuracy offline (analysis after all traces
are collected) and on-the-fly (analysis in tandem with trace collection). Our
insights show that conventional ML and modern DL-based malware detectors in
isolation cannot meet the needs of efficient and effective malware detection:
high accuracy, low false positive rate, and short classification time.Comment: 17 pages, 7 figure
Usability heuristics for fast crime data anonymization in resource-constrained contexts
This thesis considers the case of mobile crime-reporting systems that have emerged as an effective and efficient data collection method in low and middle-income countries. Analyzing the data, can be helpful in addressing crime. Since law enforcement agencies in resource-constrained context typically do not have the expertise to handle these tasks, a cost-effective strategy is to outsource the data analytics tasks to third-party service providers. However, because of the sensitivity of the data, it is expedient to consider the issue of privacy. More specifically, this thesis considers the issue of finding low-intensive computational solutions to protecting the data even from an "honest-but-curious" service provider, while at the same time generating datasets that can be queried efficiently and reliably. This thesis offers a three-pronged solution approach. Firstly, the creation of a mobile application to facilitate crime reporting in a usable, secure and privacy-preserving manner. The second step proposes a streaming data anonymization algorithm, which analyses reported data based on occurrence rate rather than at a preset time on a static repository. Finally, in the third step the concept of using privacy preferences in creating anonymized datasets was considered. By taking into account user preferences the efficiency of the anonymization process is improved upon, which is beneficial in enabling fast data anonymization. Results from the prototype implementation and usability tests indicate that having a usable and covet crime-reporting application encourages users to declare crime occurrences. Anonymizing streaming data contributes to faster crime resolution times, and user privacy preferences are helpful in relaxing privacy constraints, which makes for more usable data from the querying perspective. This research presents considerable evidence that the concept of a three-pronged solution to addressing the issue of anonymity during crime reporting in a resource-constrained environment is promising. This solution can further assist the law enforcement agencies to partner with third party in deriving useful crime pattern knowledge without infringing on users' privacy. In the future, this research can be extended to more than one low-income or middle-income countries
Implicit Smartphone User Authentication with Sensors and Contextual Machine Learning
Authentication of smartphone users is important because a lot of sensitive
data is stored in the smartphone and the smartphone is also used to access
various cloud data and services. However, smartphones are easily stolen or
co-opted by an attacker. Beyond the initial login, it is highly desirable to
re-authenticate end-users who are continuing to access security-critical
services and data. Hence, this paper proposes a novel authentication system for
implicit, continuous authentication of the smartphone user based on behavioral
characteristics, by leveraging the sensors already ubiquitously built into
smartphones. We propose novel context-based authentication models to
differentiate the legitimate smartphone owner versus other users. We
systematically show how to achieve high authentication accuracy with different
design alternatives in sensor and feature selection, machine learning
techniques, context detection and multiple devices. Our system can achieve
excellent authentication performance with 98.1% accuracy with negligible system
overhead and less than 2.4% battery consumption.Comment: Published on the IEEE/IFIP International Conference on Dependable
Systems and Networks (DSN) 2017. arXiv admin note: substantial text overlap
with arXiv:1703.0352
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
LDP-IDS: Local Differential Privacy for Infinite Data Streams
Streaming data collection is essential to real-time data analytics in various
IoTs and mobile device-based systems, which, however, may expose end users'
privacy. Local differential privacy (LDP) is a promising solution to
privacy-preserving data collection and analysis. However, existing few LDP
studies over streams are either applicable to finite streams only or suffering
from insufficient protection. This paper investigates this problem by proposing
LDP-IDS, a novel -event LDP paradigm to provide practical privacy guarantee
for infinite streams at users end, and adapting the popular budget division
framework in centralized differential privacy (CDP). By constructing a unified
error analysi for LDP, we first develop two adatpive budget division-based LDP
methods for LDP-IDS that can enhance data utility via leveraging the
non-deterministic sparsity in streams. Beyond that, we further propose a novel
population division framework that can not only avoid the high sensitivity of
LDP noise to budget division but also require significantly less communication.
Based on the framework, we also present two adaptive population division
methods for LDP-IDS with theoretical analysis. We conduct extensive experiments
on synthetic and real-world datasets to evaluate the effectiveness and
efficiency pf our proposed frameworks and methods. Experimental results
demonstrate that, despite the effectiveness of the adaptive budget division
methods, the proposed population division framework and methods can further
achieve much higher effectiveness and efficiency.Comment: accepted to SIGMOD'2
- …