169,267 research outputs found

    Parallel computing of numerical schemes and big data analytic for solving real life applications

    Get PDF
    This paper proposed the several real life applications for big data analytic using parallel computing software. Some parallel computing software under consideration are Parallel Virtual Machine, MATLAB Distributed Computing Server and Compute Unified Device Architecture to simulate the big data problems. The parallel computing is able to overcome the poor performance at the runtime, speedup and efficiency of programming in sequential computing. The mathematical models for the big data analytic are based on partial differential equations and obtained the large sparse matrices from discretization and development of the linear equation system. Iterative numerical schemes are used to solve the problems. Thus, the process of computational problems are summarized in parallel algorithm. Therefore, the parallel algorithm development is based on domain decomposition of problems and the architecture of difference parallel computing software. The parallel performance evaluations for distributed and shared memory architecture are investigated in terms of speedup, efficiency, effectiveness and temporal performance

    A distributed platform for speech recognition research

    Get PDF
    Distributed and parallel processing of big data has been applied in various applications for the past few years. Moreover, huge advancements took place in usability, economic efficiency, and multiplicity of parallel processing systems, with big data analysis and speech recognition research supported by many researchers. In this paper we examined and investigated which parts of speech recognition research may be parallelized and computed using distributed computing platforms. Firstly, we address the case of efficiently computing n-gram statistics on MapReduce platforms to build a language model (LM). Secondly, we show how the Automated Speech Recognition (ASR) tool can work efficiently regarding the speed and fault-tolerance in distributed environment such as Sun GridEngine (SGE)

    MPJ Express meets YARN:towards Java HPC on Hadoop systems

    Get PDF
    AbstractMany organizations—including academic, research, commercial institutions—have invested heavily in setting up High Performance Computing (HPC) facilities for running computational science applications. On the other hand, the Apache Hadoop software—after emerging in 2005— has become a popular, reliable, and scalable open-source framework for processing large-scale data (Big Data). Realizing the importance and significance of Big Data, an increasing number of organizations are investing in relatively cheaper Hadoop clusters for executing their mission critical data processing applications. An issue here is that system administrators at these sites might have to maintain two parallel facilities for running HPC and Hadoop computations. This, of course, is not ideal due to redundant maintenance work and poor economics. This paper attempts to bridge this gap by allowing HPC and Hadoop jobs to co-exist on a single hardware facility. We achieve this goal by exploiting YARN—Hadoop v2.0—that de-couples the computational and resource scheduling part of the Hadoop framework from HDFS. In this context, we have developed a YARN-based reference runtime system for the MPJ Express software that allows executing parallel MPI-like Java applications on Hadoop clusters. The main contribution of this paper is provide Big Data community access to MPI-like programming using MPJ Express. As an aside, this work allows parallel Java applications to perform computations on data stored in Hadoop Distributed File System (HDFS)

    Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization

    Get PDF
    Data clustering is an unsupervised machine learning task whose objective is to group together similar items. As a versatile data mining tool, data clustering has numerous applications, such as object detection and localization using data from 3D laser-based sensors, finding popular routes using geolocation data, and finding similar patterns of electricity consumption using smart meters.The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods. This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods

    Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

    Get PDF
    Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph

    Raspberry Pi Cluster for Parallel and Distributed Computing

    Get PDF
    Parallel and distributed computing have become an essential part of the ‘Big Data’ processing and analysis, especially for geophysical applications. The main goal of this project was to build a 4-node distributed computing cluster system using the Raspberry Pi single-board computers for educational and research purposes. After assembling together the system, a standard test was performed to check the system functionality. A Monte Carlo simulation to calculate π (pi) was used to demonstrate the advantages and drawbacks of parallelization and distribution of tasks and data within the cluster. Challenges encountered during installation of the software and testing phase, and their resolutions were also discussed
    corecore