2,121 research outputs found
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
Deep Learning (DL) has had an immense success in the recent past, leading to
state-of-the-art results in various domains such as image recognition and
natural language processing. One of the reasons for this success is the
increasing size of DL models and the proliferation of vast amounts of training
data being available. To keep on improving the performance of DL, increasing
the scalability of DL systems is necessary. In this survey, we perform a broad
and thorough investigation on challenges, techniques and tools for scalable DL
on distributed infrastructures. This incorporates infrastructures for DL,
methods for parallel DL training, multi-tenant resource scheduling and the
management of training and model data. Further, we analyze and compare 11
current open-source DL frameworks and tools and investigate which of the
techniques are commonly implemented in practice. Finally, we highlight future
research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea
Analytics for the Internet of Things: A Survey
The Internet of Things (IoT) envisions a world-wide, interconnected network
of smart physical entities. These physical entities generate a large amount of
data in operation and as the IoT gains momentum in terms of deployment, the
combined scale of those data seems destined to continue to grow. Increasingly,
applications for the IoT involve analytics. Data analytics is the process of
deriving knowledge from data, generating value like actionable insights from
them. This article reviews work in the IoT and big data analytics from the
perspective of their utility in creating efficient, effective and innovative
applications and services for a wide spectrum of domains. We review the broad
vision for the IoT as it is shaped in various communities, examine the
application of data analytics across IoT domains, provide a categorisation of
analytic approaches and propose a layered taxonomy from IoT data to analytics.
This taxonomy provides us with insights on the appropriateness of analytical
techniques, which in turn shapes a survey of enabling technology and
infrastructure for IoT analytics. Finally, we look at some tradeoffs for
analytics in the IoT that can shape future research
Malicious Overtones: hunting data theft in the frequency domain with one-class learning
A method for detecting electronic data theft from computer networks is
described, capable of recognizing patterns of remote exfiltration occurring
over days to weeks. Normal traffic flow data, in the form of a host's ingress
and egress bytes over time, is used to train an ensemble of one-class learners.
The detection ensemble is modular, with individual classifiers trained on
different traffic features thought to characterize malicious data transfers. We
select features that model the egress to ingress byte balance over time,
periodicity, short time-scale irregularity, and density of the traffic. The
features are most efficiently modeled in the frequency domain, which has the
added benefit that variable duration flows are transformed to a fixed-size
feature vector, and by sampling the frequency space appropriately,
long-duration flows can be tested. When trained on days- or weeks-worth of
traffic from individual hosts, our ensemble achieves a low false positive rate
(<2%) on a range of different system types. Simulated exfiltration samples with
a variety of different timing and data characteristics were generated and used
to test ensemble performance on different kinds of systems: when trained on a
client workstation's external traffic, the ensemble was generally successful at
detecting exfiltration that is not simultaneously ingress-heavy,
connection-sparse, and of short duration---a combination that is not optimal
for attackers seeking to transfer large amounts of data. Remote exfiltration is
more difficult to detect from egress-heavy systems, like web servers, with
normal traffic exhibiting timing characteristics similar to a wide range of
exfiltration types.Comment: 34 pages, 15 figures. Version submitted to ACM TOP
A Survey of Network-based Intrusion Detection Data Sets
Labeled data sets are necessary to train and evaluate anomaly-based network
intrusion detection systems. This work provides a focused literature survey of
data sets for network-based intrusion detection and describes the underlying
packet- and flow-based network data in detail. The paper identifies 15
different properties to assess the suitability of individual data sets for
specific evaluation scenarios. These properties cover a wide range of criteria
and are grouped into five categories such as data volume or recording
environment for offering a structured search. Based on these properties, a
comprehensive overview of existing data sets is given. This overview also
highlights the peculiarities of each data set. Furthermore, this work briefly
touches upon other sources for network-based data such as traffic generators
and traffic repositories. Finally, we discuss our observations and provide some
recommendations for the use and creation of network-based data sets.Comment: submitted manuscript to Computer & Securit
Active Learning for Skewed Data Sets
Consider a sequential active learning problem where, at each round, an agent
selects a batch of unlabeled data points, queries their labels and updates a
binary classifier. While there exists a rich body of work on active learning in
this general form, in this paper, we focus on problems with two distinguishing
characteristics: severe class imbalance (skew) and small amounts of initial
training data. Both of these problems occur with surprising frequency in many
web applications. For instance, detecting offensive or sensitive content in
online communities (pornography, violence, and hate-speech) is receiving
enormous attention from industry as well as research communities. Such problems
have both the characteristics we describe -- a vast majority of content is not
offensive, so the number of positive examples for such content is orders of
magnitude smaller than the negative examples. Furthermore, there is usually
only a small amount of initial training data available when building
machine-learned models to solve such problems. To address both these issues, we
propose a hybrid active learning algorithm (HAL) that balances exploiting the
knowledge available through the currently labeled training examples with
exploring the large amount of unlabeled data available. Through simulation
results, we show that HAL makes significantly better choices for what points to
label when compared to strong baselines like margin-sampling. Classifiers
trained on the examples selected for labeling by HAL easily out-perform the
baselines on target metrics (like area under the precision-recall curve) given
the same budget for labeling examples. We believe HAL offers a simple,
intuitive, and computationally tractable way to structure active learning for a
wide range of machine learning applications
Explore what-if scenarios with SONG: Social Network Write Generator
Online Social Networks (OSNs) have witnessed a tremendous growth the last few
years, becoming a platform for online users to communicate, exchange content
and even find employment. The emergence of OSNs has attracted researchers and
analysts and much data-driven research has been conducted. However, collecting
data-sets is non-trivial and sometimes it is difficult for data-sets to be
shared between researchers. The main contribution of this paper is a framework
called SONG (Social Network Write Generator) to generate synthetic traces of
write activity on OSNs. We build our framework based on a characterization
study of a large Twitter data-set and identifying the important factors that
need to be accounted for. We show how one can generate traces with SONG and
validate it by comparing against real data. We discuss how one can extend and
use SONG to explore different `what-if' scenarios. We build a Twitter clone
using 16 machines and Cassandra. We then show by example the usefulness of SONG
by stress-testing our implementation. We hope that SONG is used by researchers
and analysts for their own work that involves write activity.Comment: 11 page
Distributed traffic light control at uncoupled intersections with real-world topology by deep reinforcement learning
This work examines the implications of uncoupled intersections with local
real-world topology and sensor setup on traffic light control approaches.
Control approaches are evaluated with respect to: Traffic flow, fuel
consumption and noise emission at intersections.
The real-world road network of Friedrichshafen is depicted, preprocessed and
the present traffic light controlled intersections are modeled with respect to
state space and action space.
Different strategies, containing fixed-time, gap-based and time-based control
approaches as well as our deep reinforcement learning based control approach,
are implemented and assessed. Our novel DRL approach allows for modeling the
TLC action space, with respect to phase selection as well as selection of
transition timings. It was found that real-world topologies, and thus
irregularly arranged intersections have an influence on the performance of
traffic light control approaches. This is even to be observed within the same
intersection types (n-arm, m-phases). Moreover we could show, that these
influences can be efficiently dealt with by our deep reinforcement learning
based control approach.Comment: 32nd Conference on Neural Information Processing Systems, within
Workshop on Machine Learning for Intelligent Transportation System
Precipitation Nowcasting with Satellite Imagery
Precipitation nowcasting is a short-range forecast of rain/snow (up to 2
hours), often displayed on top of the geographical map by the weather service.
Modern precipitation nowcasting algorithms rely on the extrapolation of
observations by ground-based radars via optical flow techniques or neural
network models. Dependent on these radars, typical nowcasting is limited to the
regions around their locations. We have developed a method for precipitation
nowcasting based on geostationary satellite imagery and incorporated the
resulting data into the Yandex.Weather precipitation map (including an alerting
service with push notifications for products in the Yandex ecosystem), thus
expanding its coverage and paving the way to a truly global nowcasting service.Comment: Final KDD 2019 versio
Evaluating recommender systems for AI-driven biomedical informatics
Motivation: Many researchers with domain expertise are unable to easily apply
machine learning to their bioinformatics data due to a lack of machine learning
and/or coding expertise. Methods that have been proposed thus far to automate
machine learning mostly require programming experience as well as expert
knowledge to tune and apply the algorithms correctly. Here, we study a method
of automating biomedical data science using a web-based platform that uses AI
to recommend model choices and conduct experiments. We have two goals in mind:
first, to make it easy to construct sophisticated models of biomedical
processes; and second, to provide a fully automated AI agent that can choose
and conduct promising experiments for the user, based on the user's experiments
as well as prior knowledge. To validate this framework, we experiment with
hundreds of classification problems, comparing to state-of-the-art, automated
approaches. Finally, we use this tool to develop predictive models of septic
shock in critical care patients.
Results: We find that matrix factorization-based recommendation systems
outperform meta-learning methods for automating machine learning. This result
mirrors the results of earlier recommender systems research in other domains.
The proposed AI is competitive with state-of-the-art automated machine learning
methods in terms of choosing optimal algorithm configurations for datasets. In
our application to prediction of septic shock, the AI-driven analysis produces
a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par
with state-of-the-art deep learning results for this task, with much less
computational effort.Comment: 17 pages, 8 figures. this version fixes link to pennai in abstrac
A Survey of Intrusion Detection Systems Leveraging Host Data
This survey focuses on intrusion detection systems (IDS) that leverage
host-based data sources for detecting attacks on enterprise network. The
host-based IDS (HIDS) literature is organized by the input data source,
presenting targeted sub-surveys of HIDS research leveraging system logs, audit
data, Windows Registry, file systems, and program analysis. While system calls
are generally included in audit data, several publicly available system call
datasets have spawned a flurry of IDS research on this topic, which merits a
separate section. Similarly, a section surveying algorithmic developments that
are applicable to HIDS but tested on network data sets is included, as this is
a large and growing area of applicable literature. To accommodate current
researchers, a supplementary section giving descriptions of publicly available
datasets is included, outlining their characteristics and shortcomings when
used for IDS evaluation. Related surveys are organized and described. All
sections are accompanied by tables concisely organizing the literature and
datasets discussed. Finally, challenges, trends, and broader observations are
throughout the survey and in the conclusion along with future directions of IDS
research
- …