280,305 research outputs found
On the Design and Analysis of Parallel and Distributed Algorithms
Arrival of multicore systems has enforced a new scenario in computing, the
parallel and distributed algorithms are fast replacing the older sequential
algorithms, with many challenges of these techniques. The distributed
algorithms provide distributed processing using distributed file systems and
processing units, while network is modeled as minimum cost spanning tree. On
the other hand, the parallel processing chooses different language platforms,
data parallel vs. parallel programming, and GPUs. Processing units, memory
elements and storage are connected through dynamic distributed networks in the
form of spanning trees. The article presents foundational algorithms, analysis,
and efficiency considerations.Comment: 9 page
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
Behrooz File System (BFS)
In this thesis, the Behrooz File System (BFS) is presented, which provides an in-memory distributed file system. BFS is a simple design which combines the best of in-memory and remote file systems. BFS stores data in the main memory of commodity servers and provides a shared unified file system view over them. BFS utilizes backend storage to provide persistency and availability. Unlike most existing distributed in-memory storage systems, BFS supports a general purpose POSIX-like file interface. BFS is built by grouping multiple servers’ memory together; therefore, if applications and BFS servers are co-located, BFS is a highly efficient design because this architecture minimizes inter-node communication. This pattern is common in distributed computing environments and data analytics applications. A set of microbenchmarks and SPEC SFS 2014 benchmark are used to evaluate different aspects of BFS, such as throughput, reliability, and scalability. The evaluation results indicate the simple design of BFS is successful in delivering the expected performance, while certain workloads reveal limitations of BFS in handling a large number of files. Addressing these limitations, as well as other potential improvements, are considered as future work
SharkGraph: A Time Series Distributed Graph System
Current graph systems can easily process billions of data, however when
increased to exceed hundred billions, the performance decreases dramatically,
time series data always be very huge, consequently computation on time series
graphs still remains challenging nowadays. In current piece of work, we
introduces SharkGraph, a (distributed file system) DFS-based time series graph
system, used a novel storage structure (Time Series Graph Data File) TGF, By
reading file stream to iterate graph computation, SharkGraph is able to execute
batch graph query, simulation, data mining, or clustering algorithm on exceed
hundred billions edge size industry graph. Through well defined experiments
that shows SharkGraph performs well on large-scale graph processing, also can
support time traversal for graphs, and recover state at any position in the
timeline. By repeating experiments reported for existing distributed systems
like GraphX, we demonstrate that SharkGraph can easily handle hundreds billions
of data, rather than GraphX which met many problems such as memory issues and
skewed distribution on graph traversal. Compared with other graph systems
SharkGraph uses less memory and more efficiently to process the same graph.Comment: 7 pages, 7 figures, 1 algorith
Lazy Evaluation and Nondeterminism make Backus\u27 FP-systems More Practical
Backus\u27 FP-systems are made more practical by introducing into them lazy evaluation and nondeterminism. This is done in the framework of a concrete programming language called FP*. From the one hand, this language is almost as mathematical as FP-systems are. From the other hand, it gives the possibility to manage secondary memory and to develop such applications as, for instance, interactive and distributed file systems. Experimental versions of a compiler and an interpreter for the FP* language are implemented
Mapping Datasets to Object Storage System
Access libraries such as ROOT and HDF5 allow users to interact with datasets
using high level abstractions, like coordinate systems and associated slicing
operations. Unfortunately, the implementations of access libraries are based on
outdated assumptions about storage systems interfaces and are generally unable
to fully benefit from modern fast storage devices. The situation is getting
worse with rapidly evolving storage devices such as non-volatile memory and
ever larger datasets. This project explores distributed dataset mapping
infrastructures that can integrate and scale out existing access libraries
using Ceph's extensible object model, avoiding re-implementation or even
modifications of these access libraries as much as possible. These programmable
storage extensions coupled with our distributed dataset mapping techniques
enable: 1) access library operations to be offloaded to storage system servers,
2) the independent evolution of access libraries and storage systems and 3)
fully leveraging of the existing load balancing, elasticity, and failure
management of distributed storage systems like Ceph. They also create more
opportunities to conduct storage server-local optimizations specific to storage
servers. For example, storage servers might include local key/value stores
combined with chunk stores that require different optimizations than a local
file system. As storage servers evolve to support new storage devices like
non-volatile memory, these server-local optimizations can be implemented while
minimizing disruptions to applications. We will report progress on the means by
which distributed dataset mapping can be abstracted over particular access
libraries, including access libraries for ROOT data, and how we address some of
the challenges revolving around data partitioning and composability of access
operations
Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units
A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be efficiently utilized.This thesis provides a set of approximation and compression techniques for GPUs, with the goal of efficiently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization.This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more efficiently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floating-point values and narrow integers for increased capacity and performance.Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-of-the-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines
- …