193,346 research outputs found
File System Simulation: Hierarchical Performance Measurement and Modeling
File systems are very important components in a computer system. File system simulation can help to predict the performance of new system designs. It offers the advantages of the flexibility of modeling and the cost and time savings of utilizing simulation instead of full implementation. Being able to predict end-to-end file system performance against a pre-defined workload can help system designers to make decisions that could affect their entire product line, involving several million dollars of investment. This dissertation presents detailed simulation-based performance models of the Linux ext3 file system and the PVFS parallel file system. The models are developed using Colored Petri Nets. A performance study, using the models, shows that the obtained results are close to the expected behavior of the real file system. The model shows that file system parameters have significant impact on the performance of the I/O when compared to the parameters of the disk subsystem
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big data platform -- to parallelize
many-task applications. We present Kira, a flexible and distributed astronomy
image processing toolkit using Apache Spark. We then use the Kira toolkit to
implement a Source Extractor application for astronomy images, called Kira SE.
With Kira SE as the use case, we study the programming flexibility, dataflow
richness, scheduling capacity and performance of Apache Spark running on the
EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an
equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon
EC2 cloud. Furthermore, we show that by leveraging software originally designed
for big data infrastructure, Kira SE achieves competitive performance to the C
implementation running on the NERSC Edison supercomputer. Our experience with
Kira indicates that emerging Big Data platforms such as Apache Spark are a
performant alternative for many-task scientific applications
Challenging Ubiquitous Inverted Files
Stand-alone ranking systems based on highly optimized inverted file structures are generally considered ātheā solution for building search engines. Observing various developments in software and hardware, we argue however that IR research faces a complex engineering problem in the quest for more flexible yet efficient retrieval systems. We propose to base the development of retrieval systems on āthe database approachā: mapping high-level declarative specifications of the retrieval process into efficient query plans. We present the Mirror DBMS as a prototype implementation of a retrieval system based on this approach
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
A Logical Model and Data Placement Strategies for MEMS Storage Devices
MEMS storage devices are new non-volatile secondary storages that have
outstanding advantages over magnetic disks. MEMS storage devices, however, are
much different from magnetic disks in the structure and access characteristics.
They have thousands of heads called probe tips and provide the following two
major access facilities: (1) flexibility: freely selecting a set of probe tips
for accessing data, (2) parallelism: simultaneously reading and writing data
with the set of probe tips selected. Due to these characteristics, it is
nontrivial to find data placements that fully utilize the capability of MEMS
storage devices. In this paper, we propose a simple logical model called the
Region-Sector (RS) model that abstracts major characteristics affecting data
retrieval performance, such as flexibility and parallelism, from the physical
MEMS storage model. We also suggest heuristic data placement strategies based
on the RS model and derive new data placements for relational data and
two-dimensional spatial data by using those strategies. Experimental results
show that the proposed data placements improve the data retrieval performance
by up to 4.0 times for relational data and by up to 4.8 times for
two-dimensional spatial data of approximately 320 Mbytes compared with those of
existing data placements. Further, these improvements are expected to be more
marked as the database size grows.Comment: 37 page
: A Command-line Catalogue Cross-matching tool for modern astrophysical survey data
In the current data-driven science era, it is needed that data analysis
techniques has to quickly evolve to face with data whose dimensions has
increased up to the Petabyte scale. In particular, being modern astrophysics
based on multi-wavelength data organized into large catalogues, it is crucial
that the astronomical catalog cross-matching methods, strongly dependant from
the catalogues size, must ensure efficiency, reliability and scalability.
Furthermore, multi-band data are archived and reduced in different ways, so
that the resulting catalogues may differ each other in formats, resolution,
data structure, etc, thus requiring the highest generality of cross-matching
features. We present (Command-line Catalogue Cross-match), a
multi-platform application designed to efficiently cross-match massive
catalogues from modern surveys. Conceived as a stand-alone command-line process
or a module within generic data reduction/analysis pipeline, it provides the
maximum flexibility, in terms of portability, configuration, coordinates and
cross-matching types, ensuring high performance capabilities by using a
multi-core parallel processing paradigm and a sky partitioning algorithm.Comment: 6 pages, 4 figures, proceedings of the IAU-325 symposium on
Astroinformatics, Cambridge University pres
- ā¦