115 research outputs found
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
A shared-disk parallel cluster file system
Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance
computing (HPC) as well as IT environments. HPC and IT are quite different environments
and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs).
These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say
that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds.
Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough
for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include:
· Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data.
· Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required).
A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was
slightly modified, and two kernel modules and a user-level daemon were added. In the
prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN.
Our benchmarks for non-overlapping writers over a single file shared among processes
running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while
being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa
IBM Shared University Research (SUR
Scalable and Highly Available Database Systems in the Cloud
Cloud computing allows users to tap into a massive pool of shared computing
resources such as servers, storage, and network. These resources are provided as a
service to the users allowing them to “plug into the cloud” similar to a utility grid.
The promise of the cloud is to free users from the tedious and often complex task of
managing and provisioning computing resources to run applications. At the same
time, the cloud brings several additional benefits including: a pay-as-you-go cost
model, easier deployment of applications, elastic scalability, high availability, and
a more robust and secure infrastructure.
One important class of applications that users are increasingly deploying in
the cloud is database management systems. Database management systems differ
from other types of applications in that they manage large amounts of state that
is frequently updated, and that must be kept consistent at all scales and in the
presence of failure. This makes it difficult to provide scalability and high availability
for database systems in the cloud. In this thesis, we show how we can exploit
cloud technologies and relational database systems to provide a highly available
and scalable database service in the cloud.
The first part of the thesis presents RemusDB, a reliable, cost-effective high
availability solution that is implemented as a service provided by the virtualization
platform. RemusDB can make any database system highly available with little or
no code modifications by exploiting the capabilities of virtualization. In the second
part of the thesis, we present two systems that aim to provide elastic scalability
for database systems in the cloud using two very different approaches. The three
systems presented in this thesis bring us closer to the goal of building a scalable
and reliable transactional database service in the cloud
Analytical considerations for transactional cache protocols
Since the early nineties transactional cache protocols have been
intensively studied in the context of client-server database
systems.
Research has developed a variety of
protocols and compared different aspects of their quality using
simulation systems and applying semi-standardized benchmarks.
Unfortunately none of the related publications substantiated
their experimental findings by thorough analytical considerations.
We try to close this gap at least partially by presenting
comprensive and highly accurate analytical formulas for quality
aspects of two important transactional cache protocols.
We consider the non-adaptive variants of the "Callback
Read Protocol" (CBR) and the "Optimistic Concurrency Control
Protocol"
(OCC).
The paper studies their cache filling size and the number of
messages they
produce
for the so-called UNIFORM workload. In many cases
the cache filling size may considerably differ from a given
maximum cache size - a phenomenon which has been overlooked by
former
publications.
Moreover for OCC, we also give a highly accurate formula which
forecasts
the transaction abortion rate. All formulas are
compared against corresponding simulation results in order to
validate
their correctness
Theory and Practice of Transactional Method Caching
Nowadays, tiered architectures are widely accepted for constructing large
scale information systems. In this context application servers often form the
bottleneck for a system's efficiency. An application server exposes an object
oriented interface consisting of set of methods which are accessed by
potentially remote clients. The idea of method caching is to store results of
read-only method invocations with respect to the application server's interface
on the client side. If the client invokes the same method with the same
arguments again, the corresponding result can be taken from the cache without
contacting the server. It has been shown that this approach can considerably
improve a real world system's efficiency.
This paper extends the concept of method caching by addressing the case where
clients wrap related method invocations in ACID transactions. Demarcating
sequences of method calls in this way is supported by many important
application server standards. In this context the paper presents an
architecture, a theory and an efficient protocol for maintaining full
transactional consistency and in particular serializability when using a method
cache on the client side. In order to create a protocol for scheduling cached
method results, the paper extends a classical transaction formalism. Based on
this extension, a recovery protocol and an optimistic serializability protocol
are derived. The latter one differs from traditional transactional cache
protocols in many essential ways. An efficiency experiment validates the
approach: Using the cache a system's performance and scalability are
considerably improved
Scalable hosting of web applications
Modern Web sites have evolved from simple monolithic systems to complex multitiered systems. In contrast to traditional Web sites, these sites do not simply deliver pre-written content but dynamically generate content using (one or more) multi-tiered Web applications. In this thesis, we addressed the question: How to host multi-tiered Web applications in a scalable manner? Scaling up a Web application requires scaling its individual tiers. To this end, various research works have proposed techniques that employ replication or caching solutions at different tiers. However, most of these techniques aim to optimize the performance of individual tiers and not the entire application. A key observation made in our research is that there exists no elixir technique that performs the best for allWeb applications. Effective hosting of a Web application requires careful selection and deployment of several techniques at different tiers. To this end, we present several caching and replication strategies, such as GlobeCBC, GlobeDB and GlobeTP, to improve the scalability of different tiers of a Web application. While these techniques and systems improve the performance of the individual tiers (and eventually the application), an application's administrator is not only interested in the performance of its individual tiers but also in its endto- end performance. To this end, we propose a resource provisioning approach that allows us to choose the best resource configuration for hosting a Web application such that its end-to-end response time can be optimized with minimum usage of resources. The proposed approach is based on an analytical model for multi-tier systems, which allows us to derive expressions for estimating the mean end-to-end response time and its variance.Steen, M.R. van [Promotor]Pierre, G.E.O. [Copromotor
Practical database replication
Tese de doutoramento em InformáticaSoftware-based replication is a cost-effective approach for fault-tolerance when combined with
commodity hardware. In particular, shared-nothing database clusters built upon commodity machines
and synchronized through eager software-based replication protocols have been driven by
the distributed systems community in the last decade.
The efforts on eager database replication, however, stem from the late 1970s with initial
proposals designed by the database community. From that time, we have the distributed locking
and atomic commitment protocols. Briefly speaking, before updating a data item, all copies
are locked through a distributed lock, and upon commit, an atomic commitment protocol is
responsible for guaranteeing that the transaction’s changes are written to a non-volatile storage
at all replicas before committing it. Both these processes contributed to a poor performance.
The distributed systems community improved these processes by reducing the number of interactions
among replicas through the use of group communication and by relaxing the durability
requirements imposed by the atomic commitment protocol. The approach requires at most two
interactions among replicas and disseminates updates without necessarily applying them before
committing a transaction. This relies on a high number of machines to reduce the likelihood of
failures and ensure data resilience. Clearly, the availability of commodity machines and their
increasing processing power makes this feasible.
Proving the feasibility of this approach requires us to build several prototypes and evaluate
them with different workloads and scenarios. Although simulation environments are a good starting
point, mainly those that allow us to combine real (e.g., replication protocols, group communication)
and simulated-code (e.g., database, network), full-fledged implementations should be
developed and tested. Unfortunately, database vendors usually do not provide native support for
the development of third-party replication protocols, thus forcing protocol developers to either
change the database engines, when the source code is available, or construct in the middleware
server wrappers that intercept client requests otherwise. The former solution is hard to maintain
as new database releases are constantly being produced, whereas the latter represents a strenuous
development effort as it requires us to rebuild several database features at the middleware.
Unfortunately, the group-based replication protocols, optimistic or conservative, that had
been proposed so far have drawbacks that present a major hurdle to their practicability. The
optimistic protocols make it difficult to commit transactions in the presence of hot-spots, whereas
the conservative protocols have a poor performance due to concurrency issues.
In this thesis, we propose using a generic architecture and programming interface, titled
GAPI, to facilitate the development of different replication strategies. The idea consists of providing key extensions to multiple DBMSs (Database Management Systems), thus enabling a
replication strategy to be developed once and tested on several databases that have such extensions,
i.e., those that are replication-friendly. To tackle the aforementioned problems in groupbased
replication protocols, we propose using a novel protocol, titled AKARA. AKARA guarantees
fairness, and thus all transactions have a chance to commit, and ensures great performance
while exploiting parallelism as provided by local database engines. Finally, we outline a simple
but comprehensive set of components to build group-based replication protocols and discuss key
points in its design and implementation.A replicação baseada em software é uma abordagem que fornece um bom custo benefício para
tolerância a falhas quando combinada com hardware commodity. Em particular, os clusters de
base de dados “shared-nothing” construídos com hardware commodity e sincronizados através de
protocolos “eager” têm sido impulsionados pela comunidade de sistemas distribuídos na última
década.
Os primeiros esforços na utilização dos protocolos “eager”, decorrem da década de 70 do
século XX com as propostas da comunidade de base de dados. Dessa época, temos os protocolos
de bloqueio distribuído e de terminação atómica (i.e. “two-phase commit”). De forma sucinta,
antes de actualizar um item de dados, todas as cópias são bloqueadas através de um protocolo
de bloqueio distribuído e, no momento de efetivar uma transacção, um protocolo de terminação
atómica é responsável por garantir que as alterações da transacção são gravadas em todas as
réplicas num sistema de armazenamento não-volátil. No entanto, ambos os processos contribuem
para um mau desempenho do sistema.
A comunidade de sistemas distribuídos melhorou esses processos, reduzindo o número de
interacções entre réplicas, através do uso da comunicação em grupo e minimizando a rigidez
os requisitos de durabilidade impostos pelo protocolo de terminação atómica. Essa abordagem
requer no máximo duas interacções entre as réplicas e dissemina actualizações sem necessariamente
aplicá-las antes de efectivar uma transacção. Para funcionar, a solução depende de um
elevado número de máquinas para reduzirem a probabilidade de falhas e garantir a resiliência de
dados. Claramente, a disponibilidade de hardware commodity e o seu poder de processamento
crescente tornam essa abordagem possível.
Comprovar a viabilidade desta abordagem obriga-nos a construir vários protótipos e a avaliálos
com diferentes cargas de trabalho e cenários. Embora os ambientes de simulação sejam um
bom ponto de partida, principalmente aqueles que nos permitem combinar o código real (por
exemplo, protocolos de replicação, a comunicação em grupo) e o simulado (por exemplo, base
de dados, rede), implementações reais devem ser desenvolvidas e testadas. Infelizmente, os
fornecedores de base de dados, geralmente, não possuem suporte nativo para o desenvolvimento
de protocolos de replicação de terceiros, forçando os desenvolvedores de protocolo a mudar o
motor de base de dados, quando o código fonte está disponível, ou a construir no middleware
abordagens que interceptam as solicitações do cliente. A primeira solução é difícil de manter já
que novas “releases” das bases de dados estão constantemente a serem produzidas, enquanto a
segunda representa um desenvolvimento árduo, pois obriga-nos a reconstruir vários recursos de
uma base de dados no middleware. Infelizmente, os protocolos de replicação baseados em comunicação em grupo, optimistas ou
conservadores, que foram propostos até agora apresentam inconvenientes que são um grande obstáculo
à sua utilização. Com os protocolos optimistas é difícil efectivar transacções na presença
de “hot-spots”, enquanto que os protocolos conservadores têm um fraco desempenho devido a
problemas de concorrência.
Nesta tese, propomos utilizar uma arquitetura genérica e uma interface de programação, intitulada
GAPI, para facilitar o desenvolvimento de diferentes estratégias de replicação. A ideia
consiste em fornecer extensões chaves para múltiplos SGBDs (Database Management Systems),
permitindo assim que uma estratégia de replicação possa ser desenvolvida uma única vez e testada
em várias bases de dados que possuam tais extensões, ou seja, aquelas que são “replicationfriendly”.
Para resolver os problemas acima referidos nos protocolos de replicação baseados
em comunicação em grupo, propomos utilizar um novo protocolo, intitulado AKARA. AKARA
garante a equidade, portanto, todas as operações têm uma oportunidade de serem efectivadas,
e garante um excelente desempenho ao tirar partido do paralelismo fornecido pelos motores
de base de dados. Finalmente, propomos um conjunto simples, mas abrangente de componentes
para construir protocolos de replicação baseados em comunicação em grupo e discutimos pontoschave
na sua concepção e implementação
Hyperscale Data Processing With Network-Centric Designs
Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable.
Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale.
This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale
- …