5,492 research outputs found
Process-Oriented Parallel Programming with an Application to Data-Intensive Computing
We introduce process-oriented programming as a natural extension of
object-oriented programming for parallel computing. It is based on the
observation that every class of an object-oriented language can be instantiated
as a process, accessible via a remote pointer. The introduction of process
pointers requires no syntax extension, identifies processes with programming
objects, and enables processes to exchange information simply by executing
remote methods. Process-oriented programming is a high-level language
alternative to multithreading, MPI and many other languages, environments and
tools currently used for parallel computations. It implements natural
object-based parallelism using only minimal syntax extension of existing
languages, such as C++ and Python, and has therefore the potential to lead to
widespread adoption of parallel programming. We implemented a prototype system
for running processes using C++ with MPI and used it to compute a large
three-dimensional Fourier transform on a computer cluster built of commodity
hardware components. Three-dimensional Fourier transform is a prototype of a
data-intensive application with a complex data-access pattern. The
process-oriented code is only a few hundred lines long, and attains very high
data throughput by achieving massive parallelism and maximizing hardware
utilization.Comment: 20 pages, 1 figur
CloudJet4BigData: Streamlining Big Data via an Accelerated Socket Interface
Big data needs to feed users with fresh processing results and cloud platforms can be used to speed up big data applications. This paper describes a new data communication protocol (CloudJet) for long distance and large volume big data accessing operations to alleviate the large latencies encountered in sharing big data resources in the clouds. It encapsulates a dynamic multi-stream/multi-path engine at the socket level, which conforms to Portable Operating System Interface (POSIX) and thereby can accelerate any POSIX-compatible applications across IP based networks. It was demonstrated that CloudJet accelerates typical big data applications such as very large database (VLDB), data mining, media streaming and office applications by up to tenfold in real-world tests
LogBase: A Scalable Log-structured Database System in the Cloud
Numerous applications such as financial transactions (e.g., stock trading)
are write-heavy in nature. The shift from reads to writes in web applications
has also been accelerating in recent years. Write-ahead-logging is a common
approach for providing recovery capability while improving performance in most
storage systems. However, the separation of log and application data incurs
write overheads observed in write-heavy environments and hence adversely
affects the write throughput and recovery time in the system. In this paper, we
introduce LogBase - a scalable log-structured database system that adopts
log-only storage for removing the write bottleneck and supporting fast system
recovery. LogBase is designed to be dynamically deployed on commodity clusters
to take advantage of elastic scaling property of cloud environments. LogBase
provides in-memory multiversion indexes for supporting efficient access to data
maintained in the log. LogBase also supports transactions that bundle read and
write operations spanning across multiple records. We implemented the proposed
system and compared it with HBase and a disk-based log-structured
record-oriented system modeled after RAMCloud. The experimental results show
that LogBase is able to provide sustained write throughput, efficient data
access out of the cache, and effective system recovery.Comment: VLDB201
Performance comparison of clustered and replicated information retrieval systems
The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent
- …