Search CORE

2,121 research outputs found

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

Author: Jacobsen Hans-Arno
Mayer Ruben
Publication venue
Publication date: 25/09/2019
Field of study

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea

arXiv.org e-Print Archive

Analytics for the Internet of Things: A Survey

Author: Hall Wendy
Siow Eugene
Tiropanis Thanassis
Publication venue
Publication date: 03/07/2018
Field of study

The Internet of Things (IoT) envisions a world-wide, interconnected network of smart physical entities. These physical entities generate a large amount of data in operation and as the IoT gains momentum in terms of deployment, the combined scale of those data seems destined to continue to grow. Increasingly, applications for the IoT involve analytics. Data analytics is the process of deriving knowledge from data, generating value like actionable insights from them. This article reviews work in the IoT and big data analytics from the perspective of their utility in creating efficient, effective and innovative applications and services for a wide spectrum of domains. We review the broad vision for the IoT as it is shaped in various communities, examine the application of data analytics across IoT domains, provide a categorisation of analytic approaches and propose a layered taxonomy from IoT data to analytics. This taxonomy provides us with insights on the appropriateness of analytical techniques, which in turn shapes a survey of enabling technology and infrastructure for IoT analytics. Finally, we look at some tradeoffs for analytics in the IoT that can shape future research

arXiv.org e-Print Archive

Malicious Overtones: hunting data theft in the frequency domain with one-class learning

Author: Powell Brian A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/04/2019
Field of study

A method for detecting electronic data theft from computer networks is described, capable of recognizing patterns of remote exfiltration occurring over days to weeks. Normal traffic flow data, in the form of a host's ingress and egress bytes over time, is used to train an ensemble of one-class learners. The detection ensemble is modular, with individual classifiers trained on different traffic features thought to characterize malicious data transfers. We select features that model the egress to ingress byte balance over time, periodicity, short time-scale irregularity, and density of the traffic. The features are most efficiently modeled in the frequency domain, which has the added benefit that variable duration flows are transformed to a fixed-size feature vector, and by sampling the frequency space appropriately, long-duration flows can be tested. When trained on days- or weeks-worth of traffic from individual hosts, our ensemble achieves a low false positive rate (<2%) on a range of different system types. Simulated exfiltration samples with a variety of different timing and data characteristics were generated and used to test ensemble performance on different kinds of systems: when trained on a client workstation's external traffic, the ensemble was generally successful at detecting exfiltration that is not simultaneously ingress-heavy, connection-sparse, and of short duration---a combination that is not optimal for attackers seeking to transfer large amounts of data. Remote exfiltration is more difficult to detect from egress-heavy systems, like web servers, with normal traffic exhibiting timing characteristics similar to a wide range of exfiltration types.Comment: 34 pages, 15 figures. Version submitted to ACM TOP

arXiv.org e-Print Archive

A Survey of Network-based Intrusion Detection Data Sets

Author: Hotho Andreas
Landes Dieter
Ring Markus
Scheuring Deniz
Wunderlich Sarah
Publication venue: 'Elsevier BV'
Publication date: 06/07/2019
Field of study

Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying packet- and flow-based network data in detail. The paper identifies 15 different properties to assess the suitability of individual data sets for specific evaluation scenarios. These properties cover a wide range of criteria and are grouped into five categories such as data volume or recording environment for offering a structured search. Based on these properties, a comprehensive overview of existing data sets is given. This overview also highlights the peculiarities of each data set. Furthermore, this work briefly touches upon other sources for network-based data such as traffic generators and traffic repositories. Finally, we discuss our observations and provide some recommendations for the use and creation of network-based data sets.Comment: submitted manuscript to Computer & Securit

arXiv.org e-Print Archive

Active Learning for Skewed Data Sets

Author: Kazerouni Abbas
Najork Marc
Tata Sandeep
Xie Jing
Zhao Qi
Publication venue
Publication date: 22/05/2020
Field of study

Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority of content is not offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning for a wide range of machine learning applications

arXiv.org e-Print Archive

Explore what-if scenarios with SONG: Social Network Write Generator

Author: Erramilli Vijay
Rodriguez Pablo
Yang Xiaoyuan
Publication venue
Publication date: 24/02/2012
Field of study

Online Social Networks (OSNs) have witnessed a tremendous growth the last few years, becoming a platform for online users to communicate, exchange content and even find employment. The emergence of OSNs has attracted researchers and analysts and much data-driven research has been conducted. However, collecting data-sets is non-trivial and sometimes it is difficult for data-sets to be shared between researchers. The main contribution of this paper is a framework called SONG (Social Network Write Generator) to generate synthetic traces of write activity on OSNs. We build our framework based on a characterization study of a large Twitter data-set and identifying the important factors that need to be accounted for. We show how one can generate traces with SONG and validate it by comparing against real data. We discuss how one can extend and use SONG to explore different `what-if' scenarios. We build a Twitter clone using 16 machines and Cassandra. We then show by example the usefulness of SONG by stress-testing our implementation. We hope that SONG is used by researchers and analysts for their own work that involves write activity.Comment: 11 page

arXiv.org e-Print Archive

Distributed traffic light control at uncoupled intersections with real-world topology by deep reinforcement learning

Author: Goby Niklas
Reischl Markus
Schutera Mark
Smolarek Stefan
Publication venue
Publication date: 27/11/2018
Field of study

This work examines the implications of uncoupled intersections with local real-world topology and sensor setup on traffic light control approaches. Control approaches are evaluated with respect to: Traffic flow, fuel consumption and noise emission at intersections. The real-world road network of Friedrichshafen is depicted, preprocessed and the present traffic light controlled intersections are modeled with respect to state space and action space. Different strategies, containing fixed-time, gap-based and time-based control approaches as well as our deep reinforcement learning based control approach, are implemented and assessed. Our novel DRL approach allows for modeling the TLC action space, with respect to phase selection as well as selection of transition timings. It was found that real-world topologies, and thus irregularly arranged intersections have an influence on the performance of traffic light control approaches. This is even to be observed within the same intersection types (n-arm, m-phases). Moreover we could show, that these influences can be efficiently dealt with by our deep reinforcement learning based control approach.Comment: 32nd Conference on Neural Information Processing Systems, within Workshop on Machine Learning for Intelligent Transportation System

arXiv.org e-Print Archive

Precipitation Nowcasting with Satellite Imagery

Author: Bushmarinov Ivan
Ganshin Alexander
Grokhovetskiy Ruslan
Ivashkin Vladimir
Lebedev Vadim
Molchanov Alexander
Ovcharenko Sergey
Rudenko Irina
Solomentsev Dmitry
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/05/2019
Field of study

Precipitation nowcasting is a short-range forecast of rain/snow (up to 2 hours), often displayed on top of the geographical map by the weather service. Modern precipitation nowcasting algorithms rely on the extrapolation of observations by ground-based radars via optical flow techniques or neural network models. Dependent on these radars, typical nowcasting is limited to the regions around their locations. We have developed a method for precipitation nowcasting based on geostationary satellite imagery and incorporated the resulting data into the Yandex.Weather precipitation map (including an alerting service with push notifications for products in the Yandex ecosystem), thus expanding its coverage and paving the way to a truly global nowcasting service.Comment: Final KDD 2019 versio

arXiv.org e-Print Archive

Evaluating recommender systems for AI-driven biomedical informatics

Author: Fu Weixuan
La Cava William
Moore Jason H.
Srivatsan Durga
Vitale Steve
Williams Heather
Publication venue
Publication date: 28/04/2020
Field of study

Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients. Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods for automating machine learning. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated machine learning methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.Comment: 17 pages, 8 figures. this version fixes link to pennai in abstrac

arXiv.org e-Print Archive

A Survey of Intrusion Detection Systems Leveraging Host Data

Author: Bridges Robert A.
Chen
Glass-Vanderlan Tarrah R.
Iannacone Michael D.
Qian
Vincent Maria S.
Publication venue
Publication date: 16/05/2018
Field of study

This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. Similarly, a section surveying algorithmic developments that are applicable to HIDS but tested on network data sets is included, as this is a large and growing area of applicable literature. To accommodate current researchers, a supplementary section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research

arXiv.org e-Print Archive