1,421 research outputs found

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    Structure-Aware Dynamic Scheduler for Parallel Machine Learning

    Full text link
    Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependencies between model elements can attenuate the computational gains from parallelization and compromise correctness of inference. Recent efforts toward this issue have benefited from exploiting the static, a priori block structures residing in ML algorithms. In this paper, we take this path further by exploring the dynamic block structures and workloads therein present during ML program execution, which offers new opportunities for improving convergence, correctness, and load balancing in distributed ML. We propose and showcase a general-purpose scheduler, STRADS, for coordinating distributed updates in ML algorithms, which harnesses the aforementioned opportunities in a systematic way. We provide theoretical guarantees for our scheduler, and demonstrate its efficacy versus static block structures on Lasso and Matrix Factorization

    CEP-DTHP : A Complex Event Processing using the Dual-Tier Hybrid Paradigm Over the Stream Mining Process

    Get PDF
    CEP is a widely used technique for the reliability and recognition of arbitrarily complex patterns in enormous data streams with great performance in real time. Real-time detection of crucial events and rapid response to them are the key goals of sophisticated event processing.  The performance of event processing systems can be improved by parallelizing CEP evaluation procedures. Utilizing CEP in parallel while deploying a multi-core or distributed environment is one of the most popular and widely recognized tackles to accomplish the goal. This paper demonstrates the ability to use an unusual parallelization strategy to effectively process complicated events over streams of data. This method depends on a dual-tier hybrid paradigm that combines several parallelism levels. Thread-level or task-level parallelism (TLP) and Data-level parallelism (DLP) were combined in this research. Many threads or instruction sequences from a comparable application can run concurrently under the TLP paradigm. In the DLP paradigm, instruc-tions from a single stream operate on several data streams at the same time. In our suggested model, there are four major stages: data mining, pre-processing, load shedding, and optimization. The first phase is online data mining, following which the data is materialized into a publicly available solution that combines a CEP engine with a library. Next, data pre-processing encompasses the efficient adaptation of the content or format of raw data from many, perhaps diverse sources. Finally, parallelization approaches have been created to reduce CEP processing time. By providing this two-type parallelism, our proposed solution combines the benefits of DLP and TLP while addressing their constraints. The JAVA tool will be used to assess the suggested technique. The performance of the suggested technique is compared to that of other current ways for determining the efficacy and efficiency of the proposed algorithm

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Parallel Algorithm for Reduction of Data Processing Time in Big Data

    Get PDF
    Technological advances have allowed to collect and store large volumes of data over the years. Besides, it is significant that today's applications have high performance and can analyze these large datasets effectively. Today, it remains a challenge for data mining to make its algorithms and applications equally efficient in the need of increasing data size and dimensionality [1]. To achieve this goal, many applications rely on parallelism, because it is an area that allows the reduction of cost depending on the execution time of the algorithms because it takes advantage of the characteristics of current computer architectures to run several processes concurrently [2]. This paper proposes a parallel version of the FuzzyPred algorithm based on the amount of data that can be processed within each of the processing threads, synchronously and independently

    Virtualizing super-computation on-board UAS

    Get PDF
    Unmanned aerial systems (UAS, also known as UAV, RPAS or drones) have a great potential to support a wide variety of aerial remote sensing applications. Most UAS work by acquiring data using on-board sensors for later post-processing. Some require the data gathered to be downlinked to the ground in real-time. However, depending on the volume of data and the cost of the communications, this later option is not sustainable in the long term. This paper develops the concept of virtualizing super-computation on-board UAS, as a method to ease the operation by facilitating the downlink of high-level information products instead of raw data. Exploiting recent developments in miniaturized multi-core devices is the way to speed-up on-board computation. This hardware shall satisfy size, power and weight constraints. Several technologies are appearing with promising results for high performance computing on unmanned platforms, such as the 36 cores of the TILE-Gx36 by Tilera (now EZchip) or the 64 cores of the Epiphany-IV by Adapteva. The strategy for virtualizing super-computation on-board includes the benchmarking for hardware selection, the software architecture and the communications aware design. A parallelization strategy is given for the 36-core TILE-Gx36 for a UAS in a fire mission or in similar target-detection applications. The results are obtained for payload image processing algorithms and determine in real-time the data snapshot to gather and transfer to ground according to the needs of the mission, the processing time, and consumed watts.Unmanned aerial systems (UAS, also known as UAV, RPAS or drones) have a great potential to support a wide variety of aerial remote sensing applications. Most UAS work by acquiring data using on-board sensors for later post-processing. Some require the data gathered to be downlinked to the ground in real-time. However, depending on the volume of data and the cost of the communications, this later option is not sustainable in the long term. This paper develops the concept of virtualizing super-computation on-board UAS, as a method to ease the operation by facilitating the downlink of high-level information products instead of raw data. Exploiting recent developments in miniaturized multi-core devices is the way to speed-up on-board computation. This hardware shall satisfy size, power and weight constraints. Several technologies are appearing with promising results for high performance computing on unmanned platforms, such as the 36 cores of the TILE-Gx36 by Tilera (now EZchip) or the 64 cores of the Epiphany-IV by Adapteva. The strategy for virtualizing super-computation on-board includes the benchmarking for hardware selection, the software architecture and the communications aware design. A parallelization strategy is given for the 36-core TILE-Gx36 for a UAS in a fire mission or in similar target-detection applications. The results are obtained for payload image processing algorithms and determine in real-time the data snapshot to gather and transfer to ground according to the needs of the mission, the processing time, and consumed watts.Postprint (published version

    Parallelization Strategies for Graph-Code-Based Similarity Search

    Get PDF
    The volume of multimedia assets in collections is growing exponentially, and the retrieval of information is becoming more complex. The indexing and retrieval of multimedia content is generally implemented by employing feature graphs. Feature graphs contain semantic information on multimedia assets. Machine learning can produce detailed semantic information on multimedia assets, reflected in a high volume of nodes and edges in the feature graphs. While increasing the effectiveness of the information retrieval results, the high level of detail and also the growing collections increase the processing time. Addressing this problem, Multimedia Feature Graphs (MMFGs) and Graph Codes (GCs) have been proven to be fast and effective structures for information retrieval. However, the huge volume of data requires more processing time. As Graph Code algorithms were designed to be parallelizable, different paths of parallelization can be employed to prove or evaluate the scalability options of Graph Code processing. These include horizontal and vertical scaling with the use of Graphic Processing Units (GPUs), Multicore Central Processing Units (CPUs), and distributed computing. In this paper, we show how different parallelization strategies based on Graph Codes can be combined to provide a significant improvement in efficiency. Our modeling work shows excellent scalability with a theoretical speedup of 16,711 on a top-of-the-line Nvidia H100 GPU with 16,896 cores. Our experiments with a mediocre GPU show that a speedup of 225 can be achieved and give credence to the theoretical speedup. Thus, Graph Codes provide fast and effective multimedia indexing and retrieval, even in billion-scale use cases

    Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage

    Full text link
    We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Our method employs a branch and bound strategy with novel and aggressive pruning techniques. For instance, we use the core number of a vertex in combination with a good heuristic clique finder to efficiently remove the vast majority of the search space. In addition, we parallelize the exploration of the search tree. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with large search spaces can be pruned by other processes. We apply the algorithm to two problems: to compute temporal strong components and to compress graphs.Comment: 11 page
    corecore