4 research outputs found
Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
The Dynamic Distributed Dimensional Data Model (D4M) library implements
associative arrays in a variety of languages (Python, Julia, and Matlab/Octave)
and provides a lightweight in-memory database implementation of hypersparse
arrays that are ideal for analyzing many types of network data. D4M relies on
associative arrays which combine properties of spreadsheets, databases,
matrices, graphs, and networks, while providing rigorous mathematical
guarantees, such as linearity. Streaming updates of D4M associative arrays put
enormous pressure on the memory hierarchy. This work describes the design and
performance optimization of an implementation of hierarchical associative
arrays that reduces memory pressure and dramatically increases the update rate
into an associative array. The parameters of hierarchical associative arrays
rely on controlling the number of entries in each level in the hierarchy before
an update is cascaded. The parameters are easily tunable to achieve optimal
performance for a variety of applications. Hierarchical arrays achieve over
40,000 updates per second in a single instance. Scaling to 34,000 instances of
hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud
achieved a sustained update rate of 1,900,000,000 updates per second. This
capability allows the MIT SuperCloud to analyze extremely large streaming
network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme
Computing (HPEC) Conference 2019. arXiv admin note: text overlap with
arXiv:1807.05308, arXiv:1902.0084
Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets
Our society has never been more dependent on computer networks. Effective
utilization of networks requires a detailed understanding of the normal
background behaviors of network traffic. Large-scale measurements of networks
are computationally challenging. Building on prior work in interactive
supercomputing and GraphBLAS hypersparse hierarchical traffic matrices, we have
developed an efficient method for computing a wide variety of streaming network
quantities on diverse time scales. Applying these methods to 100,000,000,000
anonymized source-destination pairs collected at a network gateway reveals many
previously unobserved scaling relationships. These observations provide new
insights into normal network background traffic that could be used for anomaly
detection, AI feature engineering, and testing theoretical models of streaming
networks.Comment: 6 pages, 6 figures,3 tables, 49 references, accepted to IEEE HPEC
202
Fast Mapping onto Census Blocks
Pandemic measures such as social distancing and contact tracing can be
enhanced by rapidly integrating dynamic location data and demographic data.
Projecting billions of longitude and latitude locations onto hundreds of
thousands of highly irregular demographic census block polygons is
computationally challenging in both research and deployment contexts. This
paper describes two approaches labeled "simple" and "fast". The simple approach
can be implemented in any scripting language (Matlab/Octave, Python, Julia, R)
and is easily integrated and customized to a variety of research goals. This
simple approach uses a novel combination of hierarchy, sparse bounding boxes,
polygon crossing-number, vectorization, and parallel processing to achieve
100,000,000+ projections per second on 100 servers. The simple approach is
compact, does not increase data storage requirements, and is applicable to any
country or region. The fast approach exploits the thread, vector, and memory
optimizations that are possible using a low-level language (C++) and achieves
similar performance on a single server. This paper details these approaches
with the goal of enabling the broader community to quickly integrate location
and demographic data.Comment: 8 pages, 7 figures, 55 references; accepted to IEEE HPEC 202