470 research outputs found
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
Opportunistic linked data querying through approximate membership metadata
Between URI dereferencing and the SPARQL protocol lies a largely unexplored axis of possible interfaces to Linked Data, each with its own combination of trade-offs. One of these interfaces is Triple Pattern Fragments, which allows clients to execute SPARQL queries against low-cost servers, at the cost of higher bandwidth. Increasing a client's efficiency means lowering the number of requests, which can among others be achieved through additional metadata in responses. We noted that typical SPARQL query evaluations against Triple Pattern Fragments require a significant portion of membership subqueries, which check the presence of a specific triple, rather than a variable pattern. This paper studies the impact of providing approximate membership functions, i.e., Bloom filters and Golomb-coded sets, as extra metadata. In addition to reducing HTTP requests, such functions allow to achieve full result recall earlier when temporarily allowing lower precision. Half of the tested queries from a WatDiv benchmark test set could be executed with up to a third fewer HTTP requests with only marginally higher server cost. Query times, however, did not improve, likely due to slower metadata generation and transfer. This indicates that approximate membership functions can partly improve the client-side query process with minimal impact on the server and its interface
An Empirical Evaluation of Columnar Storage Formats
Columnar storage is one of the core components of a modern data analytics
system. Although many database management systems (DBMSs) have proprietary
storage formats, most provide extensive support to open-source storage formats
such as Parquet and ORC to facilitate cross-platform data sharing. But these
formats were developed over a decade ago, in the early 2010s, for the Hadoop
ecosystem. Since then, both the hardware and workload landscapes have changed
significantly.
In this paper, we revisit the most widely adopted open-source columnar
storage formats (Parquet and ORC) with a deep dive into their internals. We
designed a benchmark to stress-test the formats' performance and space
efficiency under different workload configurations. From our comprehensive
evaluation of Parquet and ORC, we identify design decisions advantageous with
modern hardware and real-world data distributions. These include using
dictionary encoding by default, favoring decoding speed over compression ratio
for integer encoding algorithms, making block compression optional, and
embedding finer-grained auxiliary data structures. Our analysis identifies
important considerations that may guide future formats to better fit modern
technology trends
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Towards Data Optimization in Storages and Networks
Title from PDF of title page, viewed on August 7, 2015Dissertation advisors: Sejun Song and Baek-Young ChoiVitaIncludes bibliographic references (pages 132-140)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015We are encountering an explosion of data volume, as a study estimates that data
will amount to 40 zeta bytes by the end of 2020. This data explosion poses significant
burden not only on data storage space but also access latency, manageability, and processing
and network bandwidth. However, large portions of the huge data volume contain
massive redundancies that are created by users, applications, systems, and communication
models. Deduplication is a technique to reduce data volume by removing redundancies.
Reliability will be even improved when data is replicated after deduplication.
Many deduplication studies such as storage data deduplication and network redundancy
elimination have been proposed to reduce storage consumption and network
bandwidth consumption. However, existing solutions are not efficient enough to optimize
data delivery path from clients to servers through network. Hence we propose a holistic
deduplication framework to optimize data in their path. Our deduplication framework
consists of three components including data sources or clients, networks, and servers. The
client component removes local redundancies in clients, the network component removes
redundant transfers coming from different clients, and the server component removes redundancies
coming from different networks.
We designed and developed components for the proposed deduplication framework.
For the server component, we developed the Hybrid Email Deduplication System
that achieves a trade-off of space savings and overhead for email systems. For the client
component, we developed the Structure Aware File and Email Deduplication for Cloudbased
Storage Systems that is very fast as well as having good space savings by using
structure-based granularity. For the network component, we developed a system called
Software-defined Deduplication as a Network and Storage service that is in-network deduplication,
and that chains storage data deduplication and network redundancy elimination
functions by using Software Defined Network to achieve both storage space and network
bandwidth savings with low processing time and memory size. We also discuss mobile
deduplication for image and video files in mobile devices. Through system implementations
and experiments, we show that the proposed framework effectively and efficiently
optimizes data volume in a holistic manner encompassing the entire data path of clients,
networks and storage servers.Introduction -- Deduplication technology -- Existing deduplication approaches -- HEDS: Hybrid Email Deduplication System -- SAFE: Structure-aware File and Email Deduplication for cloud-based storage systems -- SoftDance: Software-defined Deduplication as a Network and Storage Service -- Moblie de-duplication -- Conclusion
- …