12,345 research outputs found
21st Century Simulation: Exploiting High Performance Computing and Data Analysis
This paper identifies, defines, and analyzes the limitations imposed on Modeling and Simulation by outmoded
paradigms in computer utilization and data analysis. The authors then discuss two emerging capabilities to
overcome these limitations: High Performance Parallel Computing and Advanced Data Analysis. First, parallel
computing, in supercomputers and Linux clusters, has proven effective by providing users an advantage in
computing power. This has been characterized as a ten-year lead over the use of single-processor computers.
Second, advanced data analysis techniques are both necessitated and enabled by this leap in computing power.
JFCOM's JESPP project is one of the few simulation initiatives to effectively embrace these concepts. The
challenges facing the defense analyst today have grown to include the need to consider operations among non-combatant
populations, to focus on impacts to civilian infrastructure, to differentiate combatants from non-combatants,
and to understand non-linear, asymmetric warfare. These requirements stretch both current
computational techniques and data analysis methodologies. In this paper, documented examples and potential
solutions will be advanced. The authors discuss the paths to successful implementation based on their experience.
Reviewed technologies include parallel computing, cluster computing, grid computing, data logging, OpsResearch,
database advances, data mining, evolutionary computing, genetic algorithms, and Monte Carlo sensitivity analyses.
The modeling and simulation community has significant potential to provide more opportunities for training and
analysis. Simulations must include increasingly sophisticated environments, better emulations of foes, and more
realistic civilian populations. Overcoming the implementation challenges will produce dramatically better insights,
for trainees and analysts. High Performance Parallel Computing and Advanced Data Analysis promise increased
understanding of future vulnerabilities to help avoid unneeded mission failures and unacceptable personnel losses.
The authors set forth road maps for rapid prototyping and adoption of advanced capabilities. They discuss the
beneficial impact of embracing these technologies, as well as risk mitigation required to ensure success
Towards Loosely-Coupled Programming on Petascale Systems
We have extended the Falkon lightweight task execution framework to make
loosely coupled programming on petascale systems a practical and useful
programming model. This work studies and measures the performance factors
involved in applying this approach to enable the use of petascale systems by a
broader user community, and with greater ease. Our work enables the execution
of highly parallel computations composed of loosely coupled serial jobs with no
modifications to the respective applications. This approach allows a new-and
potentially far larger-class of applications to leverage petascale systems,
such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O
performance encountered in making this model practical, and show results using
both microbenchmarks and real applications from two domains: economic energy
modeling and molecular dynamics. Our benchmarks show that we can scale up to
160K processor-cores with high efficiency, and can achieve sustained execution
rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing,
Networking, Storage and Analysis (SuperComputing/SC) 200
The state of SQL-on-Hadoop in the cloud
Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come preconfigured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective, comparing providers including: Microsoft Azure, Amazon Web Services, Google Cloud,
and Rackspace. The study focuses on performance, readiness, scalability, and cost-effectiveness of the different solutions at entry/test level clusters sizes. Results are based on over 15,000 Hive queries derived from the industry standard TPC-H benchmark.
The study is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines.
The ALOJA Project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization.
The study benchmarks cloud providers across a diverse range instance types, and uses input data scales from 1GB to 1TB, in order to survey the popular entry-level PaaS SQL-on-Hadoop solutions, thereby establishing a common results-base upon which subsequent research can be carried out by the project. Initial results already show the main performance trends to both hardware and software configuration, pricing, similarities and architectural differences of the evaluated PaaS solutions. Whereas some
providers focus on decoupling storage and computing resources while offering network-based elastic storage, others choose to keep the local processing model from Hadoop for high performance, but reducing flexibility. Results also show the importance of application-level tuning and how keeping up-to-date hardware and software stacks can influence performance even more than replicating the on-premises model in the cloud.This work is partially supported by the Microsoft Azure for Research program, the European Research Council (ERC) under
the EUs Horizon 2020 programme (GA 639595), the Spanish Ministry of Education (TIN2015-65316-P), and the Generalitat
de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
A Low Cost Two-Tier Architecture Model For High Availability Clusters Application Load Balancing
This article proposes a design and implementation of a low cost two-tier
architecture model for high availability cluster combined with load-balancing
and shared storage technology to achieve desired scale of three-tier
architecture for application load balancing e.g. web servers. The research work
proposes a design that physically omits Network File System (NFS) server nodes
and implements NFS server functionalities within the cluster nodes, through Red
Hat Cluster Suite (RHCS) with High Availability (HA) proxy load balancing
technologies. In order to achieve a low-cost implementation in terms of
investment in hardware and computing solutions, the proposed architecture will
be beneficial. This system intends to provide steady service despite any system
components fails due to uncertainly such as network system, storage and
applications.Comment: Load balancing, high availability cluster, web server cluster
Advanced Message Routing for Scalable Distributed Simulations
The Joint Forces Command (JFCOM) Experimentation Directorate (J9)'s recent Joint Urban Operations (JUO)
experiments have demonstrated the viability of Forces Modeling and Simulation in a distributed environment. The
JSAF application suite, combined with the RTI-s communications system, provides the ability to run distributed
simulations with sites located across the United States, from Norfolk, Virginia to Maui, Hawaii. Interest-aware
routers are essential for communications in the large, distributed environments, and the current RTI-s framework
provides such routers connected in a straightforward tree topology. This approach is successful for small to medium
sized simulations, but faces a number of significant limitations for very large simulations over high-latency, wide
area networks. In particular, traffic is forced through a single site, drastically increasing distances messages must
travel to sites not near the top of the tree. Aggregate bandwidth is limited to the bandwidth of the site hosting the
top router, and failures in the upper levels of the router tree can result in widespread communications losses
throughout the system.
To resolve these issues, this work extends the RTI-s software router infrastructure to accommodate more
sophisticated, general router topologies, including both the existing tree framework and a new generalization of the
fully connected mesh topologies used in the SF Express ModSAF simulations of 100K fully interacting vehicles.
The new software router objects incorporate the scalable features of the SF Express design, while optionally using
low-level RTI-s objects to perform actual site-to-site communications. The (substantial) limitations of the original
mesh router formalism have been eliminated, allowing fully dynamic operations. The mesh topology capabilities
allow aggregate bandwidth and site-to-site latencies to match actual network performance. The heavy resource load at
the root node can now be distributed across routers at the participating sites
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
A Monitoring System for the BaBar INFN Computing Cluster
Monitoring large clusters is a challenging problem. It is necessary to
observe a large quantity of devices with a reasonably short delay between
consecutive observations. The set of monitored devices may include PCs, network
switches, tape libraries and other equipments. The monitoring activity should
not impact the performances of the system. In this paper we present PerfMC, a
monitoring system for large clusters. PerfMC is driven by an XML configuration
file, and uses the Simple Network Management Protocol (SNMP) for data
collection. SNMP is a standard protocol implemented by many networked
equipments, so the tool can be used to monitor a wide range of devices. System
administrators can display informations on the status of each device by
connecting to a WEB server embedded in PerfMC. The WEB server can produce
graphs showing the value of different monitored quantities as a function of
time; it can also produce arbitrary XML pages by applying XSL Transformations
to an internal XML representation of the cluster's status. XSL Transformations
may be used to produce HTML pages which can be displayed by ordinary WEB
browsers. PerfMC aims at being relatively easy to configure and operate, and
highly efficient. It is currently being used to monitor the Italian
Reprocessing farm for the BaBar experiment, which is made of about 200 dual-CPU
Linux machines.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 10 pages, LaTeX, 4 eps figures. PSN
MOET00
- âŠ