63 research outputs found
Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data
Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data
GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine
Recent studies showed that single-machine graph processing systems can be as
highly competitive as cluster-based approaches on large-scale problems. While
several out-of-core graph processing systems and computation models have been
proposed, the high disk I/O overhead could significantly reduce performance in
many practical cases. In this paper, we propose GraphMP to tackle big graph
analytics on a single machine. GraphMP achieves low disk I/O overhead with
three techniques. First, we design a vertex-centric sliding window (VSW)
computation model to avoid reading and writing vertices on disk. Second, we
propose a selective scheduling method to skip loading and processing
unnecessary edge shards on disk. Third, we use a compressed edge cache
mechanism to fully utilize the available memory of a machine to reduce the
amount of disk accesses for edges. Extensive evaluations have shown that
GraphMP could outperform state-of-the-art systems such as GraphChi, X-Stream
and GridGraph by 31.6x, 54.5x and 23.1x respectively, when running popular
graph applications on a billion-vertex graph
A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems
Memory disaggregation has recently been adopted in data centers to improve
resource utilization, motivated by cost and sustainability. Recent studies on
large-scale HPC facilities have also highlighted memory underutilization. A
promising and non-disruptive option for memory disaggregation is rack-scale
memory pooling, where shared memory pools supplement node-local memory. This
work outlines the prospects and requirements for adoption and clarifies several
misconceptions. We propose a quantitative method for dissecting application
requirements on the memory system from the top down in three levels, moving
from general, to multi-tier memory systems, and then to memory pooling. We
provide a multi-level profiling tool and LBench to facilitate the quantitative
approach. We evaluate a set of representative HPC workloads on an emulated
platform. Our results show that prefetching activities can significantly
influence memory traffic profiles. Interference in memory pooling has varied
impacts on applications, depending on their access ratios to memory tiers and
arithmetic intensities. Finally, in two case studies, we show the benefits of
our findings at the application and system levels, achieving 50% reduction in
remote access and 13% speedup in BFS, and reducing performance variation of
co-located workloads in interference-aware job scheduling.Comment: Accepted to SC23 (The International Conference for High Performance
Computing, Networking, Storage, and Analysis 2023
Argobots: A Lightweight Low-Level Threading and Tasking Framework
In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach
Survey of End-to-End Mobile Network Measurement Testbeds, Tools, and Services
Mobile (cellular) networks enable innovation, but can also stifle it and lead
to user frustration when network performance falls below expectations. As
mobile networks become the predominant method of Internet access, developer,
research, network operator, and regulatory communities have taken an increased
interest in measuring end-to-end mobile network performance to, among other
goals, minimize negative impact on application responsiveness. In this survey
we examine current approaches to end-to-end mobile network performance
measurement, diagnosis, and application prototyping. We compare available tools
and their shortcomings with respect to the needs of researchers, developers,
regulators, and the public. We intend for this survey to provide a
comprehensive view of currently active efforts and some auspicious directions
for future work in mobile network measurement and mobile application
performance evaluation.Comment: Submitted to IEEE Communications Surveys and Tutorials. arXiv does
not format the URL references correctly. For a correctly formatted version of
this paper go to
http://www.cs.montana.edu/mwittie/publications/Goel14Survey.pd
Programming and parallelising applications for distributed infrastructures
The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing
of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet.
The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider.
In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance.
In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent
infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous
tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types.
This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications
- …