4,213 research outputs found
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
LogBase: A Scalable Log-structured Database System in the Cloud
Numerous applications such as financial transactions (e.g., stock trading)
are write-heavy in nature. The shift from reads to writes in web applications
has also been accelerating in recent years. Write-ahead-logging is a common
approach for providing recovery capability while improving performance in most
storage systems. However, the separation of log and application data incurs
write overheads observed in write-heavy environments and hence adversely
affects the write throughput and recovery time in the system. In this paper, we
introduce LogBase - a scalable log-structured database system that adopts
log-only storage for removing the write bottleneck and supporting fast system
recovery. LogBase is designed to be dynamically deployed on commodity clusters
to take advantage of elastic scaling property of cloud environments. LogBase
provides in-memory multiversion indexes for supporting efficient access to data
maintained in the log. LogBase also supports transactions that bundle read and
write operations spanning across multiple records. We implemented the proposed
system and compared it with HBase and a disk-based log-structured
record-oriented system modeled after RAMCloud. The experimental results show
that LogBase is able to provide sustained write throughput, efficient data
access out of the cache, and effective system recovery.Comment: VLDB201
Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining
An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread
On the use of NAND flash memory in high-performance relational databases
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 47-49).High-density NAND flash storage has become relatively inexpensive due to the popularity of various consumer electronics. Recently, several manufacturers have released IDE-compatible NAND flash-based drives in sizes up to 64 GB at reasonable (sub-$1000) prices. Because flash is significantly more durable than mechanical hard drives and requires considerably less energy, there is some speculation that large data centers will adopt these devices. As database workloads make up a substantial fraction of the processing done by data centers, it is interesting to ask how switching to flash-based storage will affect the performance of database systems. We evaluate this question using IDE-based flash drives from two major manufacturers. We measure their read and write performance and find that flash has excellent random read performance, acceptable sequential read performance, and quite poor write performance compared to conventional IDE disks. We then consider how standard database algorithms are affected by these performance characteristics and find that the fast random read capability dramatically improves the performance of secondary indexes and index-based join algorithms. We next investigate using logstructured filesystems to mitigate the poor write performance of flash and find an 8.2x improvement in random write performance, but at the cost of a 3.7x decrease in random read performance. Finally, we study techniques for exploiting the inherent parallelism of multiple-chip flash devices, and we find that adaptive coding strategies can yield a 2x performance improvement over static ones. We conclude that in many cases flash disk performance is still worse than on traditional drives and that current flash technology may not yet be mature enough for widespread database adoption if performance is a dominant factor. Finally, we briefly speculate how this landscape may change based on expected performance of next-generation flash memories.by Daniel Myers.S.M
- …