11 research outputs found

    Benchmarking SciDB Data Import on HPC Systems

    Full text link
    SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.Comment: 5 pages, 4 figures, IEEE High Performance Extreme Computing (HPEC) 2016, best paper finalis

    D4M 3.0: Extended Database and Language Capabilities

    Full text link
    The D4M tool was developed to address many of today's data needs. This tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of new database engines, including SciDB. D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo database. Finally, an implementation using the Julia programming language is also now available. In this article, we describe some of our latest additions to the D4M toolbox and our upcoming D4M 3.0 release. We show through benchmarking and scaling results that we can achieve fast SciDB ingest using the D4M-SciDB connector, that using Graphulo can enable graph algorithms on scales that can be memory limited, and that the Julia implementation of D4M achieves comparable performance or exceeds that of the existing MATLAB(R) implementation.Comment: IEEE HPEC 201

    Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M

    Full text link
    The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) Conference 2019. arXiv admin note: text overlap with arXiv:1807.05308, arXiv:1902.0084

    SIAM Data Mining Brings It to Annual Meeting

    Get PDF
    The Data Mining Activity Group is one of SIAM\u27s most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members on - GraphBLAS (Aydın Buluç) - Algorithms and statistical methods for noisy network analysis (Sanjukta Bhowmick & Ben Miller) - Inferring networks from non-network data (Rajmonda Caceres, Ivan Brugere & Tanya Y. Berger-Wolf) - Visual analytics (Jordan Crouser) - Mining in graph data (Jennifer Webster, Mahantesh Halappanavar & Emilie Hogan) - Scientific computing and big data (Vijay Gadepally) These minisymposia were well received by the broader SIAM community, and below are some of the key highlights

    Performance Evaluation of Job Scheduling and Resource Allocation in Apache Spark

    Get PDF
    Advancements in data acquisition techniques and devices are revolutionizing the way image data are collected, managed and processed. Devices such as time-lapse cameras and multispectral cameras generate large amount of image data daily. Therefore, there is a clear need for many organizations and researchers to deal with large volume of image data efficiently. On the other hand, Big Data processing on distributed systems such as Apache Spark are gaining popularity in recent years. Apache Spark is a widely used in-memory framework for distributed processing of large datasets on a cluster of inexpensive computers. This thesis proposes using Spark for distributed processing of large amount of image data in a time efficient manner. However, to share cluster resources efficiently, multiple image processing applications submitted to the cluster must be appropriately scheduled by Spark cluster managers to take advantage of all the compute power and storage capacity of the cluster. Spark can run on three cluster managers including Standalone, Mesos and YARN, and provides several configuration parameters that control how resources are allocated and scheduled. Using default settings for these multiple parameters is not enough to efficiently share cluster resources between multiple applications running concurrently. This leads to performance issues and resource underutilization because cluster administrators and users do not know which Spark cluster manager is the right fit for their applications and how the scheduling behaviour and parameter settings of these cluster managers affect the performance of their applications in terms of resource utilization and response times. This thesis parallelized a set of heterogeneous image processing applications including Image Registration, Flower Counter and Image Clustering, and presents extensive comparisons and analyses of running these applications on a large server and a Spark cluster using three different cluster managers for resource allocation, including Standalone, Apache Mesos and Hodoop YARN. In addition, the thesis examined the two different job scheduling and resource allocations modes available in Spark: static and dynamic allocation. Furthermore, the thesis explored the various configurations available on both modes that control speculative execution of tasks, resource size and the number of parallel tasks per job, and explained their impact on image processing applications. The thesis aims to show that using optimal values for these parameters reduces jobs makespan, maximizes cluster utilization, and ensures each application is allocated a fair share of cluster resources in a timely manner

    Adaptive Asynchronous Control and Consistency in Distributed Data Exploration Systems

    Get PDF
    Advances in machine learning and streaming systems provide a backbone to transform vast arrays of raw data into valuable information. Leveraging distributed execution, analysis engines can process this information effectively within an iterative data exploration workflow to solve problems at unprecedented rates. However, with increased input dimensionality, a desire to simultaneously share and isolate information, as well as overlapping and dependent tasks, this process is becoming increasingly difficult to maintain. User interaction derails exploratory progress due to manual oversight on lower level tasks such as tuning parameters, adjusting filters, and monitoring queries. We identify human-in-the-loop management of data generation and distributed analysis as an inhibiting problem precluding efficient online, iterative data exploration which causes delays in knowledge discovery and decision making. The flexible and scalable systems implementing the exploration workflow require semi-autonomous methods integrated as architectural support to reduce human involvement. We, thus, argue that an abstraction layer providing adaptive asynchronous control and consistency management over a series of individual tasks coordinated to achieve a global objective can significantly improve data exploration effectiveness and efficiency. This thesis introduces methodologies which autonomously coordinate distributed execution at a lower level in order to synchronize multiple efforts as part of a common goal. We demonstrate the impact on data exploration through serverless simulation ensemble management and multi-model machine learning by showing improved performance and reduced resource utilization enabling a more productive semi-autonomous exploration workflow. We focus on the specific genres of molecular dynamics and personalized healthcare, however, the contributions are applicable to a wide variety of domains

    Elastic Dataflow Processing on the Cloud

    Get PDF
    Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο. Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων. Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε- πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of modern applications on Big Data, especially due to the concept of elasticity, which characterizes them: resources can be leased on demand and used for as much time as needed, offering the ability to create virtual infrastructures that change dynamically over time. Such applications often require processing of complex queries that are expressed in a high-level language and are typically transformed into data processing flows (dataflows). A logical question that arises is whether elasticity affects dataflow execution and in which way. It seems reasonable that the execution is faster when more resources are used, however the monetary cost is higher. This gives rise to the concept eco-elasticity, an additional kind of elasticity that comes from economics, and captures the trade-offs between the response time of the system and the amount of money we pay for it as influenced by the use of different amounts of resources. In this thesis, we approach the elasticity of clouds in a unified way that combines both the traditional notion and eco-elasticity. This unified elasticity concept is essential for the development of auto-tuned systems in cloud environments. First, we demonstrate that eco-elasticity exists in several common tasks that appear in practice and that can be discovered using a simple, yet highly scalable and efficient algorithm. Next, we present two cases of auto-tuned algorithms that use the unified model of elasticity in order to adapt to the query workload: 1) processing analytical queries in the form of tree execution plans in order to maximize profit and 2) automated index management taking into account compute and storage re- sources. Finally, we describe EXAREME, a system for elastic data processing on the cloud that has been used and extended in this work. The system offers declarative languages that are based on SQL with user-defined functions (UDFs) extended with parallelism primi- tives. EXAREME exploits both elasticities of clouds by dynamically allocating and deallocating compute resources in order to adapt to the query workload

    Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року

    Get PDF
    Second International Conference on Sustainable Futures: Environmental, Technological, Social and Economic Matters (ICSF 2021). Kryvyi Rih, Ukraine, May 19-21, 2021.Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року
    corecore