217,429 research outputs found

    A Hadoop use case for engineering data

    No full text
    This paper presents the VELaSSCo project (Visualization for Extremely LArge-Scale Scientific Computing). It aims to develop a platform to manipulate scientific data used by FEM (Finite Element Method) and DEM (Discrete Element Method) simulations. The project focuses on the development of a distributed, heterogeneous and high-performance platform, enabling the scientific communities to store, process and visualize huge amounts of data. The platform is compatible with current hardware capabilities, as well as future hardware

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

    Scalable Teacher Forcing Network for Semi-Supervised Large Scale Data Streams

    Full text link
    The large-scale data stream problem refers to high-speed information flow which cannot be processed in scalable manner under a traditional computing platform. This problem also imposes expensive labelling cost making the deployment of fully supervised algorithms unfeasible. On the other hand, the problem of semi-supervised large-scale data streams is little explored in the literature because most works are designed in the traditional single-node computing environments while also being fully supervised approaches. This paper offers Weakly Supervised Scalable Teacher Forcing Network (WeScatterNet) to cope with the scarcity of labelled samples and the large-scale data streams simultaneously. WeScatterNet is crafted under distributed computing platform of Apache Spark with a data-free model fusion strategy for model compression after parallel computing stage. It features an open network structure to address the global and local drift problems while integrating a data augmentation, annotation and auto-correction (DA3DA^3) method for handling partially labelled data streams. The performance of WeScatterNet is numerically evaluated in the six large-scale data stream problems with only 25%25\% label proportions. It shows highly competitive performance even if compared with fully supervised learners with 100%100\% label proportions.Comment: This paper has been accepted for publication in Information Science

    Parallel and Context Based Search in Cloud using Multi Agent System

    Get PDF
    Cloud Computing is one of the fast growing Technology. Cloud computing support large scale infrastructure used to increase high performance of computing. This technology support agents and with the help of integration of the agents that is Multi Agent System (MAS) which is capable of intelligent behavior. They run in an environment where they communicate with each other using message passing technique. Each agent has its own set of behavior and they run independent of each other. When a message arrives each agent shows their own behavior and hence an agent shows their coordination. The use of MAS in cloud computing help us for searching context with better performance. The JADE is a platform which supports agent. This paper discusses about Cloud computing models and architectures, information retrieving technique and the use of MAS that improve the performance of big data search from Distributed File System (DFS) which is difficult to achieve using single agent or thread. Keywords: Cloud Computing, Distributed File System, JADE, MA

    HPC-GAP: engineering a 21st-century high-performance computer algebra system

    Get PDF
    Symbolic computation has underpinned a number of key advances in Mathematics and Computer Science. Applications are typically large and potentially highly parallel, making them good candidates for parallel execution at a variety of scales from multi-core to high-performance computing systems. However, much existing work on parallel computing is based around numeric rather than symbolic computations. In particular, symbolic computing presents particular problems in terms of varying granularity and irregular task sizes thatdo not match conventional approaches to parallelisation. It also presents problems in terms of the structure of the algorithms and data. This paper describes a new implementation of the free open-source GAP computational algebra system that places parallelism at the heart of the design, dealing with the key scalability and cross-platform portability problems. We provide three system layers that deal with the three most important classes of hardware: individual shared memory multi-core nodes, mid-scale distributed clusters of (multi-core) nodes, and full-blown HPC systems, comprising large-scale tightly-connected networks of multi-core nodes. This requires us to develop new cross-layer programming abstractions in the form of new domain-specific skeletons that allow us to seamlessly target different hardware levels. Our results show that, using our approach, we can achieve good scalability and speedups for two realistic exemplars, on high-performance systems comprising up to 32,000 cores, as well as on ubiquitous multi-core systems and distributed clusters. The work reported here paves the way towards full scale exploitation of symbolic computation by high-performance computing systems, and we demonstrate the potential with two major case studies

    A Hadoop use case for engineering data

    Get PDF
    This paper presents the VELaSSCo project (Visualization for Extremely LArge-Scale Scientific Computing). It aims to develop a platform to manipulate scientific data used by FEM (Finite Element Method) and DEM (Discrete Element Method) simulations. The project focuses on the development of a distributed, heterogeneous and high-performance platform, enabling the scientific communities to store, process and visualize huge amounts of data. The platform is compatible with current hardware capabilities, as well as future hardware

    High Throughput Protein Similarity Searches in the LIBI Grid Problem Solving Environment

    Get PDF
    Bioinformatics applications are naturally distributed, due to distribution of involved data sets, experimental data and biological databases. They require high computing power, owing to the large size of data sets and the complexity of basic computations, may access heterogeneous data, where heterogeneity is in data format, access policy, distribution, etc., and require a secure infrastructure, because they could access private data owned by different organizations. The Problem Solving Environment (PSE) is an approach and a technology that can fulfil such bioinformatics requirements. The PSE can be used for the definition and composition of complex applications, hiding programming and configuration details to the user that can concentrate only on the specific problem. Moreover, Grids can be used for building geographically distributed collaborative problem solving environments and Grid aware PSEs can search and use dispersed high performance computing, networking, and data resources. In this work, the PSE solution has been chosen as the integration platform of bioinformatics tools and data sources. In particular an experiment of multiple sequence alignment on large scale, supported by the LIBIPSE, is presented

    Cloud Storage Performance and Security Analysis with Hadoop and GridFTP

    Get PDF
    Even though cloud server has been around for a few years, most of the web hosts today have not converted to cloud yet. If the purpose of the cloud server is distributing and storing files on the internet, FTP servers were much earlier than the cloud. FTP server is sufficient to distribute content on the internet. Therefore, is it worth to shift from FTP server to cloud server? The cloud storage provider declares high durability and availability for their users, and the ability to scale up for more storage space easily could save users tons of money. However, does it provide higher performance and better security features? Hadoop is a very popular platform for cloud computing. It is free software under Apache License. It is written in Java and supports large data processing in a distributed environment. Characteristics of Hadoop include partitioning of data, computing across thousands of hosts, and executing application computations in parallel. Hadoop Distributed File System allows rapid data transfer up to thousands of terabytes, and is capable of operating even in the case of node failure. GridFTP supports high-speed data transfer for wide-area networks. It is based on the FTP and features multiple data channels for parallel transfers. This report describes the technology behind HDFS and enhancement to the Hadoop security features with Kerberos. Based on data transfer performance and security features of HDFS and GridFTP server, we can decide if we should replace GridFTP server with HDFS. According to our experiment result, we conclude that GridFTP server provides better throughput than HDFS, and Kerberos has minimal impact to HDFS performance. We proposed a solution which users authenticate with HDFS first, and get the file from HDFS server to the client using GridFTP
    corecore