11 research outputs found
Benchmarking SciDB Data Import on HPC Systems
SciDB is a scalable, computational database management system that uses an
array model for data storage. The array data model of SciDB makes it ideally
suited for storing and managing large amounts of imaging data. SciDB is
designed to support advanced analytics in database, thus reducing the need for
extracting data for analysis. It is designed to be massively parallel and can
run on commodity hardware in a high performance computing (HPC) environment. In
this paper, we present the performance of SciDB using simulated image data. The
Dynamic Distributed Dimensional Data Model (D4M) software is used to implement
the benchmark on a cluster running the MIT SuperCloud software stack. A peak
performance of 2.2M database inserts per second was achieved on a single node
of this system. We also show that SciDB and the D4M toolbox provide more
efficient ways to access random sub-volumes of massive datasets compared to the
traditional approaches of reading volumetric data from individual files. This
work describes the D4M and SciDB tools we developed and presents the initial
performance results. This performance was achieved by using parallel inserts, a
in-database merging of arrays as well as supercomputing techniques, such as
distributed arrays and single-program-multiple-data programming.Comment: 5 pages, 4 figures, IEEE High Performance Extreme Computing (HPEC)
2016, best paper finalis
D4M 3.0: Extended Database and Language Capabilities
The D4M tool was developed to address many of today's data needs. This tool
is used by hundreds of researchers to perform complex analytics on unstructured
data. Over the past few years, the D4M toolbox has evolved to support
connectivity with a variety of new database engines, including SciDB.
D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo
database. Finally, an implementation using the Julia programming language is
also now available. In this article, we describe some of our latest additions
to the D4M toolbox and our upcoming D4M 3.0 release. We show through
benchmarking and scaling results that we can achieve fast SciDB ingest using
the D4M-SciDB connector, that using Graphulo can enable graph algorithms on
scales that can be memory limited, and that the Julia implementation of D4M
achieves comparable performance or exceeds that of the existing MATLAB(R)
implementation.Comment: IEEE HPEC 201
Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
The Dynamic Distributed Dimensional Data Model (D4M) library implements
associative arrays in a variety of languages (Python, Julia, and Matlab/Octave)
and provides a lightweight in-memory database implementation of hypersparse
arrays that are ideal for analyzing many types of network data. D4M relies on
associative arrays which combine properties of spreadsheets, databases,
matrices, graphs, and networks, while providing rigorous mathematical
guarantees, such as linearity. Streaming updates of D4M associative arrays put
enormous pressure on the memory hierarchy. This work describes the design and
performance optimization of an implementation of hierarchical associative
arrays that reduces memory pressure and dramatically increases the update rate
into an associative array. The parameters of hierarchical associative arrays
rely on controlling the number of entries in each level in the hierarchy before
an update is cascaded. The parameters are easily tunable to achieve optimal
performance for a variety of applications. Hierarchical arrays achieve over
40,000 updates per second in a single instance. Scaling to 34,000 instances of
hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud
achieved a sustained update rate of 1,900,000,000 updates per second. This
capability allows the MIT SuperCloud to analyze extremely large streaming
network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme
Computing (HPEC) Conference 2019. arXiv admin note: text overlap with
arXiv:1807.05308, arXiv:1902.0084
SIAM Data Mining Brings It to Annual Meeting
The Data Mining Activity Group is one of SIAM\u27s most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members on - GraphBLAS (Aydın Buluç) - Algorithms and statistical methods for noisy network analysis (Sanjukta Bhowmick & Ben Miller) - Inferring networks from non-network data (Rajmonda Caceres, Ivan Brugere & Tanya Y. Berger-Wolf) - Visual analytics (Jordan Crouser) - Mining in graph data (Jennifer Webster, Mahantesh Halappanavar & Emilie Hogan) - Scientific computing and big data (Vijay Gadepally) These minisymposia were well received by the broader SIAM community, and below are some of the key highlights
Performance Evaluation of Job Scheduling and Resource Allocation in Apache Spark
Advancements in data acquisition techniques and devices are revolutionizing the way image data are collected, managed and processed. Devices such as time-lapse cameras and multispectral cameras generate large amount of image data daily. Therefore, there is a clear need for many organizations and researchers to deal with large volume of image data efficiently. On the other hand, Big Data processing on distributed systems such as Apache Spark are gaining popularity in recent years. Apache Spark is a widely used in-memory framework for distributed processing of large datasets on a cluster of inexpensive computers. This thesis proposes using Spark for distributed processing of large amount of image data in a time efficient manner. However, to share cluster resources efficiently, multiple image processing applications submitted to the cluster must be appropriately scheduled by Spark cluster managers to take advantage of all the compute power and storage capacity of the cluster. Spark can run on three cluster managers including Standalone, Mesos and YARN, and provides several configuration parameters that control how resources are allocated and scheduled. Using default settings for these multiple parameters is not enough to efficiently share cluster resources between multiple applications running concurrently. This leads to performance issues and resource underutilization because cluster administrators and users do not know which Spark cluster manager is the right fit for their applications and how the scheduling behaviour and parameter settings of these cluster managers affect the performance of their applications in terms of resource utilization and response times.
This thesis parallelized a set of heterogeneous image processing applications including Image Registration, Flower Counter and Image Clustering, and presents extensive comparisons and analyses of running these applications on a large server and a Spark cluster using three different cluster managers for resource allocation, including Standalone, Apache Mesos and Hodoop YARN. In addition, the thesis examined the two different job scheduling and resource allocations modes available in Spark: static and dynamic allocation. Furthermore, the thesis explored the various configurations available on both modes that control speculative execution of tasks, resource size and the number of parallel tasks per job, and explained their impact on image processing applications. The thesis aims to show that using optimal values for these parameters reduces jobs makespan, maximizes cluster utilization, and ensures each application is allocated a fair share of cluster resources in a timely manner
Adaptive Asynchronous Control and Consistency in Distributed Data Exploration Systems
Advances in machine learning and streaming systems provide a backbone to transform vast arrays of raw data into valuable information. Leveraging distributed execution, analysis engines can process this information effectively within an iterative data exploration workflow to solve problems at unprecedented rates. However, with increased input dimensionality, a desire to simultaneously share and isolate information, as well as overlapping and dependent tasks, this process is becoming increasingly difficult to maintain. User interaction derails exploratory progress due to manual oversight on lower level tasks such as tuning parameters, adjusting filters, and monitoring queries. We identify human-in-the-loop management of data generation and distributed analysis as an inhibiting problem precluding efficient online, iterative data exploration which causes delays in knowledge discovery and decision making. The flexible and scalable systems implementing the exploration workflow require semi-autonomous methods integrated as architectural support to reduce human involvement. We, thus, argue that an abstraction layer providing adaptive asynchronous control and consistency management over a series of individual tasks coordinated to achieve a global objective can significantly improve data exploration effectiveness and efficiency. This thesis introduces methodologies which autonomously coordinate distributed execution at a lower level in order to synchronize multiple efforts as part of a common goal. We demonstrate the impact on data exploration through serverless simulation ensemble management and multi-model machine learning by showing improved performance and reduced resource utilization enabling a more productive semi-autonomous exploration workflow. We focus on the specific genres of molecular dynamics and personalized healthcare, however, the contributions are applicable to a wide variety of domains
Elastic Dataflow Processing on the Cloud
Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη
επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της
ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι
μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο
ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη
υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες
εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα
για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα
πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και
τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες
δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η
ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο.
Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν
χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα
ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος
επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη
θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και
του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων.
Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την
ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο
ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η
ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην
σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα
νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους
τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να
βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε-
πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που
χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο
ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το
συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων
ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την
μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων
ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και
των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για
την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο
εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα
προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με
συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined
Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με
στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται
τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας
υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of
modern applications on Big Data, especially due to the concept of elasticity,
which characterizes them: resources can be leased on demand and used for as
much time as needed, offering the ability to create virtual infrastructures
that change dynamically over time. Such applications often require processing
of complex queries that are expressed in a high-level language and are
typically transformed into data processing flows (dataflows). A logical
question that arises is whether elasticity affects dataflow execution and in
which way. It seems reasonable that the execution is faster when more resources
are used, however the monetary cost is higher. This gives rise to the concept
eco-elasticity, an additional kind of elasticity that comes from economics, and
captures the trade-offs between the response time of the system and the amount
of money we pay for it as influenced by the use of different amounts of
resources.
In this thesis, we approach the elasticity of clouds in a unified way that
combines both the traditional notion and eco-elasticity. This unified
elasticity concept is essential for the development of auto-tuned systems in
cloud environments. First, we demonstrate that eco-elasticity exists in several
common tasks that appear in practice and that can be discovered using a simple,
yet highly scalable and efficient algorithm. Next, we present two cases of
auto-tuned algorithms that use the unified model of elasticity in order to
adapt to the query workload: 1) processing analytical queries in the form of
tree execution plans in order to maximize profit and 2) automated index
management taking into account compute and storage re- sources. Finally, we
describe EXAREME, a system for elastic data processing on the cloud that has
been used and extended in this work. The system offers declarative languages
that are based on SQL with user-defined functions (UDFs) extended with
parallelism primi- tives. EXAREME exploits both elasticities of clouds by
dynamically allocating and deallocating compute resources in order to adapt to
the query workload
Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року
Second International Conference on Sustainable Futures: Environmental, Technological, Social and Economic Matters (ICSF 2021). Kryvyi Rih, Ukraine, May 19-21, 2021.Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року