27,739 research outputs found
Inferring land use from mobile phone activity
Understanding the spatiotemporal distribution of people within a city is
crucial to many planning applications. Obtaining data to create required
knowledge, currently involves costly survey methods. At the same time
ubiquitous mobile sensors from personal GPS devices to mobile phones are
collecting massive amounts of data on urban systems. The locations,
communications, and activities of millions of people are recorded and stored by
new information technologies. This work utilizes novel dynamic data, generated
by mobile phone users, to measure spatiotemporal changes in population. In the
process, we identify the relationship between land use and dynamic population
over the course of a typical week. A machine learning classification algorithm
is used to identify clusters of locations with similar zoned uses and mobile
phone activity patterns. It is shown that the mobile phone data is capable of
delivering useful information on actual land use that supplements zoning
regulations.Comment: To be presented at ACM UrbComp201
An Automated Social Graph De-anonymization Technique
We present a generic and automated approach to re-identifying nodes in
anonymized social networks which enables novel anonymization techniques to be
quickly evaluated. It uses machine learning (decision forests) to matching
pairs of nodes in disparate anonymized sub-graphs. The technique uncovers
artefacts and invariants of any black-box anonymization scheme from a small set
of examples. Despite a high degree of automation, classification succeeds with
significant true positive rates even when small false positive rates are
sought. Our evaluation uses publicly available real world datasets to study the
performance of our approach against real-world anonymization strategies, namely
the schemes used to protect datasets of The Data for Development (D4D)
Challenge. We show that the technique is effective even when only small numbers
of samples are used for training. Further, since it detects weaknesses in the
black-box anonymization scheme it can re-identify nodes in one social network
when trained on another.Comment: 12 page
How Unique is Your .onion? An Analysis of the Fingerprintability of Tor Onion Services
Recent studies have shown that Tor onion (hidden) service websites are
particularly vulnerable to website fingerprinting attacks due to their limited
number and sensitive nature. In this work we present a multi-level feature
analysis of onion site fingerprintability, considering three state-of-the-art
website fingerprinting methods and 482 Tor onion services, making this the
largest analysis of this kind completed on onion services to date.
Prior studies typically report average performance results for a given
website fingerprinting method or countermeasure. We investigate which sites are
more or less vulnerable to fingerprinting and which features make them so. We
find that there is a high variability in the rate at which sites are classified
(and misclassified) by these attacks, implying that average performance figures
may not be informative of the risks that website fingerprinting attacks pose to
particular sites.
We analyze the features exploited by the different website fingerprinting
methods and discuss what makes onion service sites more or less easily
identifiable, both in terms of their traffic traces as well as their webpage
design. We study misclassifications to understand how onion service sites can
be redesigned to be less vulnerable to website fingerprinting attacks. Our
results also inform the design of website fingerprinting countermeasures and
their evaluation considering disparate impact across sites.Comment: Accepted by ACM CCS 201
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
A cognitive based Intrusion detection system
Intrusion detection is one of the primary mechanisms to provide computer
networks with security. With an increase in attacks and growing dependence on
various fields such as medicine, commercial, and engineering to give services
over a network, securing networks have become a significant issue. The purpose
of Intrusion Detection Systems (IDS) is to make models which can recognize
regular communications from abnormal ones and take necessary actions. Among
different methods in this field, Artificial Neural Networks (ANNs) have been
widely used. However, ANN-based IDS, has two main disadvantages: 1- Low
detection precision. 2- Weak detection stability. To overcome these issues,
this paper proposes a new approach based on Deep Neural Network (DNN. The
general mechanism of our model is as follows: first, some of the data in
dataset is properly ranked, afterwards, dataset is normalized with Min-Max
normalizer to fit in the limited domain. Then dimensionality reduction is
applied to decrease the amount of both useless dimensions and computational
cost. After the preprocessing part, Mean-Shift clustering algorithm is the used
to create different subsets and reduce the complexity of dataset. Based on each
subset, two models are trained by Support Vector Machine (SVM) and deep
learning method. Between two models for each subset, the model with a higher
accuracy is chosen. This idea is inspired from philosophy of divide and
conquer. Hence, the DNN can learn each subset quickly and robustly. Finally, to
reduce the error from the previous step, an ANN model is trained to gain and
use the results in order to be able to predict the attacks. We can reach to
95.4 percent of accuracy. Possessing a simple structure and less number of
tunable parameters, the proposed model still has a grand generalization with a
high level of accuracy in compared to other methods such as SVM, Bayes network,
and STL.Comment: 18 pages, 6 figure
In-depth comparative evaluation of supervised machine learning approaches for detection of cybersecurity threats
This paper describes the process and results of analyzing CICIDS2017, a modern, labeled data set for testing intrusion detection systems. The data set is divided into several days, each pertaining to different attack classes (Dos, DDoS, infiltration, botnet, etc.). A pipeline has been created that includes nine supervised learning algorithms. The goal was binary classification of benign versus attack traffic. Cross-validated parameter optimization, using a voting mechanism that includes five classification metrics, was employed to select optimal parameters. These results were interpreted to discover whether certain parameter choices were dominant for most (or all) of the attack classes. Ultimately, every algorithm was retested with optimal parameters to obtain the final classification scores. During the review of these results, execution time, both on consumerand corporate-grade equipment, was taken into account as an additional requirement. The work detailed in this paper establishes a novel supervised machine learning performance baseline for CICIDS2017
- …