Search CORE

50 research outputs found

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Author: Barbary Kyle
Franklin Michael J.
Nothaft Frank Austin
Patterson David A.
Perlmutter Saul
Sparks Evan
Zahn Oliver
Zhang Zhao
Publication venue
Publication date: 22/12/2015
Field of study

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Hybrid cloud and cluster computing paradigms for life science applications

Author: Adam Hughes
Bingjing Zhang
C Evangelinos
Chu
E Walker
G Fox
GC Fox
GC Fox
GC Fox
Geoffrey Fox
Hui Li
J Dean
J Ekanayake
J Ekanayake
J Ekanayake
J Ekanayake
J Ekanayake
J Lange
Jaliya Ekanayake
Jong Youl Choi
Judy Qiu
JW Sammon
Saliya Ekanayake
Seung-Hee Bae
SH Bae
T Gunarathne
Tak-Lon Wu
Thilina Gunarathne
X Qiu
Yang Ruan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Parallel programming paradigms and frameworks in big data era

Author: Dobre Ciprian M.
Xhafa Xhafa Fatos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Big Data – A State-of-the-Art

Author: Felden Carsten
Pospiech Marco
Publication venue: AIS Electronic Library (AISeL)
Publication date: 30/07/2012
Field of study

The term Big Data has an increased and tautological occurrence in scientific publications. It is of interest how and whether the data provisioning is able to support enterprises in the handling and value creation of this emerging issue. Considering the amount of growing publication and the fuzzy nature of this term, an overview is requested to avoid duplications to gain relevant findings and to identify potential research gaps. To address this issue, a general literature review is accomplished, which extrapolates and clusters discussed research fields and potential gaps. It becomes apparent that a huge part of the research is technical driven. Moreover, no identified paper addresses the research area of functional data provisioning. This initiates further investigations to discuss whether Big Data itself negate such intention or research has missed it and improvements regarding Big Data are possible

AIS Electronic Library (AISeL)

Performance Evaluation of LINQ to HPC and Hadoop for Big Data

Author: Sivasubramaniam Ravishankar
Publication venue: UNF Digital Commons
Publication date: 01/01/2013
Field of study

There is currently considerable enthusiasm around the MapReduce paradigm, and the distributed computing paradigm for analysis of large volumes of data. The Apache Hadoop is the most popular open source implementation of MapReduce model and LINQ to HPC is Microsoft\u27s alternative to open source Hadoop. In this thesis, the performance of LINQ to HPC and Hadoop are compared using different benchmarks. To this end, we identified four benchmarks (Grep, Word Count, Read and Write) that we have run on LINQ to HPC as well as on Hadoop. For each benchmark, we measured each system’s performance metrics (Execution Time, Average CPU utilization and Average Memory utilization) for various degrees of parallelism on clusters of different sizes. Results revealed some interesting trade-offs. For example, LINQ to HPC performed better on three out of the four benchmarks (Grep, Read and Write), whereas Hadoop performed better on the Word Count benchmark. While more research that is extensive has focused on Hadoop, there are not many references to similar research on the LINQ to HPC platform, which is slowly evolving during the writing of this thesis

UNF Digital Commons

Reify Your Collection Queries for Modularity and Speed!

Author: Eichberg Michael
Giarrusso Paolo G.
Kästner Christian
Mitschke Ralf
Ostermann Klaus
Rendel Tillmann
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/10/2012
Field of study

Modularity and efficiency are often contradicting requirements, such that programers have to trade one for the other. We analyze this dilemma in the context of programs operating on collections. Performance-critical code using collections need often to be hand-optimized, leading to non-modular, brittle, and redundant code. In principle, this dilemma could be avoided by automatic collection-specific optimizations, such as fusion of collection traversals, usage of indexing, or reordering of filters. Unfortunately, it is not obvious how to encode such optimizations in terms of ordinary collection APIs, because the program operating on the collections is not reified and hence cannot be analyzed. We propose SQuOpt, the Scala Query Optimizer--a deep embedding of the Scala collections API that allows such analyses and optimizations to be defined and executed within Scala, without relying on external tools or compiler extensions. SQuOpt provides the same "look and feel" (syntax and static typing guarantees) as the standard collections API. We evaluate SQuOpt by re-implementing several code analyses of the Findbugs tool using SQuOpt, show average speedups of 12x with a maximum of 12800x and hence demonstrate that SQuOpt can reconcile modularity and efficiency in real-world applications.Comment: 20 page

arXiv.org e-Print Archive

TUbiblio

Crossref