53 research outputs found
Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets
2017 Summer.Includes bibliographical references.Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks
Peer-to-Peer Networks and Computation: Current Trends and Future Perspectives
This research papers examines the state-of-the-art in the area of P2P networks/computation. It attempts to identify the challenges that confront the community of P2P researchers and developers, which need to be addressed before the potential of P2P-based systems, can be effectively realized beyond content distribution and file-sharing applications to build real-world, intelligent and commercial software systems. Future perspectives and some thoughts on the evolution of P2P-based systems are also provided
Clouder : a flexible large scale decentralized object store
Programa Doutoral em Informática MAP-iLarge scale data stores have been initially introduced to support a few concrete extreme
scale applications such as social networks. Their scalability and availability
requirements often outweigh sacrificing richer data and processing models, and even
elementary data consistency. In strong contrast with traditional relational databases
(RDBMS), large scale data stores present very simple data models and APIs, lacking
most of the established relational data management operations; and relax consistency
guarantees, providing eventual consistency.
With a number of alternatives now available and mature, there is an increasing
willingness to use them in a wider and more diverse spectrum of applications, by
skewing the current trade-off towards the needs of common business users, and easing
the migration from current RDBMS. This is particularly so when used in the context
of a Cloud solution such as in a Platform as a Service (PaaS).
This thesis aims at reducing the gap between traditional RDBMS and large scale
data stores, by seeking mechanisms to provide additional consistency guarantees and
higher level data processing primitives in large scale data stores. The devised mechanisms
should not hinder the scalability and dependability of large scale data stores.
Regarding, higher level data processing primitives this thesis explores two complementary
approaches: by extending data stores with additional operations such as general
multi-item operations; and by coupling data stores with other existent processing
facilities without hindering scalability.
We address this challenges with a new architecture for large scale data stores, efficient
multi item access for large scale data stores, and SQL processing atop large scale
data stores. The novel architecture allows to find the right trade-offs among flexible
usage, efficiency, and fault-tolerance. To efficient support multi item access we extend first generation large scale data store’s data models with tags and a multi-tuple data
placement strategy, that allow to efficiently store and retrieve large sets of related data
at once. For efficient SQL support atop scalable data stores we devise design modifications
to existing relational SQL query engines, allowing them to be distributed.
We demonstrate our approaches with running prototypes and extensive experimental
evaluation using proper workloads.Os sistemas de armazenamento de dados de grande escala foram inicialmente desenvolvidos
para suportar um leque restrito de aplicacões de escala extrema, como as
redes sociais. Os requisitos de escalabilidade e elevada disponibilidade levaram a
sacrificar modelos de dados e processamento enriquecidos e até a coerência dos dados.
Em oposição aos tradicionais sistemas relacionais de gestão de bases de dados
(SRGBD), os sistemas de armazenamento de dados de grande escala apresentam modelos
de dados e APIs muito simples. Em particular, evidenciasse a ausência de muitas
das conhecidas operacões de gestão de dados relacionais e o relaxamento das garantias
de coerência, fornecendo coerência futura.
Atualmente, com o número de alternativas disponíveis e maduras, existe o crescente
interesse em usá-los num maior e diverso leque de aplicacões, orientando o atual
compromisso para as necessidades dos típicos clientes empresariais e facilitando a
migração a partir das atuais SRGBD. Isto é particularmente importante no contexto de
soluções cloud como plataformas como um servic¸o (PaaS).
Esta tese tem como objetivo reduzir a diferencça entre os tradicionais SRGDBs e os
sistemas de armazenamento de dados de grande escala, procurando mecanismos que
providenciem garantias de coerência mais fortes e primitivas com maior capacidade de
processamento. Os mecanismos desenvolvidos não devem comprometer a escalabilidade
e fiabilidade dos sistemas de armazenamento de dados de grande escala. No que
diz respeito às primitivas com maior capacidade de processamento esta tese explora
duas abordagens complementares : a extensão de sistemas de armazenamento de dados
de grande escala com operacões genéricas de multi objeto e a junção dos sistemas de armazenamento de dados de grande escala com mecanismos existentes de processamento
e interrogac¸ ˜ao de dados, sem colocar em causa a escalabilidade dos mesmos.
Para isso apresent´amos uma nova arquitetura para os sistemas de armazenamento
de dados de grande escala, acesso eficiente a m´ultiplos objetos, e processamento de
SQL sobre sistemas de armazenamento de dados de grande escala. A nova arquitetura
permite encontrar os compromissos adequados entre flexibilidade, eficiˆencia e
tolerˆancia a faltas. De forma a suportar de forma eficiente o acesso a m´ultiplos objetos
estendemos o modelo de dados de sistemas de armazenamento de dados de grande escala
da primeira gerac¸ ˜ao com palavras-chave e definimos uma estrat´egia de colocac¸ ˜ao
de dados para m´ultiplos objetos que permite de forma eficiente armazenar e obter
grandes quantidades de dados de uma s´o vez. Para o suporte eficiente de SQL sobre
sistemas de armazenamento de dados de grande escala, analisámos a arquitetura dos
motores de interrogação de SRGBDs e fizemos alterações que permitem que sejam
distribuídos.
As abordagens propostas são demonstradas através de protótipos e uma avaliacão
experimental exaustiva recorrendo a cargas adequadas baseadas em aplicações reais
Resource discovery for distributed computing systems: A comprehensive survey
Large-scale distributed computing environments provide a vast amount of heterogeneous computing resources from different sources for resource sharing and distributed computing. Discovering appropriate resources in such environments is a challenge which involves several different subjects. In this paper, we provide an investigation on the current state of resource discovery protocols, mechanisms, and platforms for large-scale distributed environments, focusing on the design aspects. We classify all related aspects, general steps, and requirements to construct a novel resource discovery solution in three categories consisting of structures, methods, and issues. Accordingly, we review the literature, analyzing various aspects for each category
Enabling autoscaling for in-memory storage in cluster computing framework
2019 Spring.Includes bibliographical references.IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational challenges for large-scale geospatial analytics. We have designed and implemented STRETCH , an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of S TRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime
Recommended from our members
HARD: Hybrid Adaptive Resource Discovery for Jungle Computing
In recent years, Jungle Computing has emerged as a distributed computing paradigm based on simultaneous combination of various hierarchical and distributed computing environments which are composed by large number of heterogeneous resources. In such a computing environment, the resources and the underlying computation and communication infrastructures are highly-hierarchical and heterogeneous. This creates a lot of difficulty and complexity for finding the proper resources in a precise way in order to run a particular job on the system efficiently. This paper proposes Hybrid Adaptive Resource Discovery (HARD), a novel efficient and highly scalable resource-discovery approach which is built upon a virtual hierarchical overlay based on self-organization and self-adaptation of processing resources in the system, where the computing resources are organized into distributed hierarchies according to a proposed hierarchical multi-layered resource description model. The proposed approach supports distributed query processing within and across hierarchical layers by deploying various distributed resource discovery services and functionalities in the system which are implemented using different adapted algorithms and mechanisms in each level of hierarchy. The proposed approach addresses the requirements for resource discovery in Jungle Computing environments such as high-hierarchy, high-heterogeneity, high-scalability and dynamicity. Simulation results show significant scalability and efficiency of the proposed approach over highly heterogeneous, hierarchical and dynamic computing environments
Recommended from our members
A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes
With large volumes of geo-tagged data collected in various applications, spatial query pro- cessing becomes essential. Query engines depend on efficient indexes to expedite processing. There are three main challenges: scaling out to accommodate large volumes of spatial data, support- ing transactional primitives for strong consistency guarantees, and adapting to highly dynamic workloads. This thesis proposes a paradigm for scalable, transactional, and efficient spatial indexes to significantly reduce development efforts in designing and comparing multiple spatial indexes.This thesis first introduces a distributed and transactional key value store called DTranx to persist the spatial indexes. DTranx follows the SEDA architecture to exploit high concurrency in multi-core environments and it adopts a hybrid of optimistic concurrency control and two-phase commit protocols to narrow down the critical sections of distributed locking during transaction com- mits. Moreover, DTranx integrates a persistent memory based write-ahead log to reduce durability overhead and combines a garbage collection mechanism without affecting normal transactions. To maintain high throughput for search workloads when databases are constantly updated, snapshot transactions are introduced.Then, a paradigm is presented with a set of intuitive APIs and a Mempool runtime to re- duce development efforts. Mempool transparently synchronizes local states of data structures with DTranx and it handles two critical tasks: address translation and transparent server synchroniza- tion, of which the latter includes transaction construction and data synchronization. Furthermore, a dynamic partitioning strategy is integrated into DTranx to generate partitioning and replication plans that reduce inter-server communications and balance resource usage.Lastly, single-threaded data structures BTree and RTree are converted into distributed versions within two weeks. The BTree and RTree achieve 253.07 kops/sec and 77.83 kops/sec through- put respectively for pure search operations in a 25-server cluster
A DHT-Based Discovery Service for the Internet of Things
Current trends towards the Future Internet are envisaging the conception of novel services endowed with context-aware and autonomic capabilities to improve end users' quality of life. The Internet of Things paradigm is expected to contribute towards this ambitious vision by proposing models and mechanisms enabling the creation of networks of "smart things" on a large scale. It is widely recognized that efficient mechanisms for discovering available resources and capabilities are required to realize such vision. The contribution of this work consists in a novel discovery service for the Internet of Things. The proposed solution adopts a peer-to-peer approach for guaranteeing scalability, robustness, and easy maintenance of the overall system. While most existing peer-to-peer discovery services proposed for the IoT support solely exact match queries on a single attribute (i.e., the object identifier), our solution can handle multiattribute and range queries. We defined a layered approach by distinguishing three main aspects: multiattribute indexing, range query support, peer-to-peer routing. We chose to adopt an over-DHT indexing scheme to guarantee ease of design and implementation principles. We report on the implementation of a Proof of Concept in a dangerous goods monitoring scenario, and, finally, we discuss test results for structural properties and query performance evaluation
- …