29 research outputs found
Um modelo de arquitetura em camadas empilhadas para Big Data
Debido a la necesidad del análisis para los nuevos tipos de datos no estructurados, repetitivos y no repetitivos, surge Big Data. Aunque el tema ha sido extensamente difundido, no hay disponible una arquitectura de referencia para sistemas Big Data que incorpore el tratamiento de grandes volúmenes de datos en bruto, agregados y no agregados ni propuestas completas para manejar el ciclo de vida de los datos o una terminología estandarizada en ésta área, menos una metodología que soporte el diseño y desarrollo de dicha arquitectura. Solo hay arquitecturas de pequeña escala, de tipo industrial, orientadas al producto, que se reducen al alcance de la solución de una compañía o grupo de compañías, que se enfocan en la tecnología, pero omiten el punto de vista funcional. El artículo explora los requerimientos para la formulación de un modelo arquitectural que soporte la analítica y la gestión de datos estructurados y no estructurados, repetitivos y no repetitivos, y contempla algunas propuestas arquitecturales de tipo industrial o tecnológicas, para al final proponer un modelo lógico de arquitectura multicapas escalonado, que pretende dar respuesta a los requerimientos que cubran, tanto a Data Warehouse, como a Big Data.Until recently, the issue of analytical data was related to Data Warehouse, but due to the necessity of analyzing new types of unstructured data, both repetitive and non-repetitive, Big Data arises. Although this subject has been widely studied, there is not available a reference architecture for Big Data systems involved with the processing of large volumes of raw data, aggregated and non-aggregated. There are not complete proposals for managing the lifecycle of data or standardized terminology, even less a methodology supporting the design and development of that architecture. There are architectures in small-scale, industrial and product-oriented, which limit their scope to solutions for a company or group of companies, focused on technology but omitting the functionality. This paper explores the requirements for the formulation of an architectural model that supports the analysis and management of data: structured, repetitive and non-repetitive unstructured; there are some architectural proposals –industrial or technological type– to propose a logical model of multi-layered tiered architecture, which aims to respond to the requirements covering both Data Warehouse and Big Data.A questão da analítica de dados foi relacionada com o Data Warehouse, mas devido à necessidade de uma análise de novos tipos de dados não estruturados, repetitivos e não repetitivos, surge a Big Data. Embora o tema tenha sido amplamente difundido, não existe uma arquitetura de referência para os sistemas Big Data que incorpore o processamento de grandes volumes de dados brutos, agregados e não agregados; nem propostas completas para a gestão do ciclo de vida dos dados, nem uma terminologia padronizada nesta área, e menos uma metodologia que suporte a concepção e desenvolvimento de dita arquitetura. O que existe são arquiteturas em pequena escala, de tipo industrial, orientadas ao produto, limitadas ao alcance da solução de uma empresa ou grupo de empresas, focadas na tecnologia, mas que omitem o ponto de vista funcional. Este artigo explora os requisitos para a formulação de um modelo de arquitetura que possa suportar a analítica e a gestão de dados estruturados e não estruturados, repetitivos e não repetitivos. Dessa exploração contemplam-se algumas propostas arquiteturais de tipo industrial ou tecnológicas, eu propor um modelo lógico de arquitetura em camadas empilhadas, que visa responder às exigências que abrangem tanto Data Warehouse como Big Data
Communication Steps for Parallel Query Processing
We consider the problem of computing a relational query on a large input
database of size , using a large number of servers. The computation is
performed in rounds, and each server can receive only
bits of data, where is a parameter that controls
replication. We examine how many global communication steps are needed to
compute . We establish both lower and upper bounds, in two settings. For a
single round of communication, we give lower bounds in the strongest possible
model, where arbitrary bits may be exchanged; we show that any algorithm
requires , where is the fractional vertex
cover of the hypergraph of . We also give an algorithm that matches the
lower bound for a specific class of databases. For multiple rounds of
communication, we present lower bounds in a model where routing decisions for a
tuple are tuple-based. We show that for the class of tree-like queries there
exists a tradeoff between the number of rounds and the space exponent
. The lower bounds for multiple rounds are the first of their
kind. Our results also imply that transitive closure cannot be computed in O(1)
rounds of communication
Securing cloud-based data analytics: A practical approach
The ubiquitous nature of computers is driving a massive increase in the amount of data generated by humans and machines. The shift to cloud technologies is a paradigm change that offers considerable financial and administrative gains in the effort to analyze these data. However, governmental and business institutions wanting to tap into these gains are concerned with security issues. The cloud presents new vulnerabilities and is dominated by new kinds of applications, which calls for new security solutions. In the direction of analyzing massive amounts of data, tools like MapReduce, Apache Storm, Dryad and higher-level scripting languages like Pig Latin and DryadLINQ have significantly improved corresponding tasks for software developers. The equally important aspect of securing computations performed by these tools and ensuring confidentiality of data has seen very little support emerge for programmers.
In this dissertation, we present solutions to a. secure computations being run in the cloud by leveraging BFT replication coupled with fault isolation and b. secure data from being leaked by computing directly on encrypted data. For securing computations (a.), we leverage a combination of variable-degree clustering, approximated and offline output comparison, smart deployment, and separation of duty to achieve a parameterized tradeoff between fault tolerance and overhead in practice. We demonstrate the low overhead achieved with our solution when securing data-flow computations expressed in Apache Pig, and Hadoop. Our solution allows assured computation with less than 10 percent latency overhead as shown by our evaluation. For securing data (b.), we present novel data flow analyses and program transformations for Pig Latin and Apache Storm, that automatically enable the execution of corresponding scripts on encrypted data. We avoid fully homomorphic encryption because of its prohibitively high cost; instead, in some cases, we rely on a minimal set of operations performed by the client. We present the algorithms used for this translation, and empirically demonstrate the practical performance of our approach as well as improvements for programmers in terms of the effort required to preserve data confidentiality
Stubby: A Transformation-based Optimizer for MapReduce Workflows
There is a growing trend of performing analysis on large datasets using
workflows composed of MapReduce jobs connected through producer-consumer
relationships based on data. This trend has spurred the development of a number
of interfaces--ranging from program-based to query-based interfaces--for
generating MapReduce workflows. Studies have shown that the gap in performance
can be quite large between optimized and unoptimized workflows. However,
automatic cost-based optimization of MapReduce workflows remains a challenge
due to the multitude of interfaces, large size of the execution plan space, and
the frequent unavailability of all types of information needed for
optimization. We introduce a comprehensive plan space for MapReduce workflows
generated by popular workflow generators. We then propose Stubby, a cost-based
optimizer that searches selectively through the subspace of the full plan space
that can be enumerated correctly and costed based on the information available
in any given setting. Stubby enumerates the plan space based on plan-to-plan
transformations and an efficient search algorithm. Stubby is designed to be
extensible to new interfaces and new types of optimizations, which is a
desirable feature given how rapidly MapReduce systems are evolving. Stubby's
efficiency and effectiveness have been evaluated using representative workflows
from many domains.Comment: VLDB201
Scalable and Declarative Information Extraction in a Parallel Data Analytics System
Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets
Elastic Dataflow Processing on the Cloud
Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη
επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της
ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι
μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο
ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη
υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες
εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα
για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα
πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και
τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες
δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η
ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο.
Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν
χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα
ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος
επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη
θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και
του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων.
Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την
ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο
ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η
ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην
σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα
νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους
τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να
βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε-
πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που
χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο
ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το
συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων
ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την
μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων
ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και
των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για
την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο
εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα
προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με
συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined
Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με
στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται
τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας
υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of
modern applications on Big Data, especially due to the concept of elasticity,
which characterizes them: resources can be leased on demand and used for as
much time as needed, offering the ability to create virtual infrastructures
that change dynamically over time. Such applications often require processing
of complex queries that are expressed in a high-level language and are
typically transformed into data processing flows (dataflows). A logical
question that arises is whether elasticity affects dataflow execution and in
which way. It seems reasonable that the execution is faster when more resources
are used, however the monetary cost is higher. This gives rise to the concept
eco-elasticity, an additional kind of elasticity that comes from economics, and
captures the trade-offs between the response time of the system and the amount
of money we pay for it as influenced by the use of different amounts of
resources.
In this thesis, we approach the elasticity of clouds in a unified way that
combines both the traditional notion and eco-elasticity. This unified
elasticity concept is essential for the development of auto-tuned systems in
cloud environments. First, we demonstrate that eco-elasticity exists in several
common tasks that appear in practice and that can be discovered using a simple,
yet highly scalable and efficient algorithm. Next, we present two cases of
auto-tuned algorithms that use the unified model of elasticity in order to
adapt to the query workload: 1) processing analytical queries in the form of
tree execution plans in order to maximize profit and 2) automated index
management taking into account compute and storage re- sources. Finally, we
describe EXAREME, a system for elastic data processing on the cloud that has
been used and extended in this work. The system offers declarative languages
that are based on SQL with user-defined functions (UDFs) extended with
parallelism primi- tives. EXAREME exploits both elasticities of clouds by
dynamically allocating and deallocating compute resources in order to adapt to
the query workload
Clouder : a flexible large scale decentralized object store
Programa Doutoral em Informática MAP-iLarge scale data stores have been initially introduced to support a few concrete extreme
scale applications such as social networks. Their scalability and availability
requirements often outweigh sacrificing richer data and processing models, and even
elementary data consistency. In strong contrast with traditional relational databases
(RDBMS), large scale data stores present very simple data models and APIs, lacking
most of the established relational data management operations; and relax consistency
guarantees, providing eventual consistency.
With a number of alternatives now available and mature, there is an increasing
willingness to use them in a wider and more diverse spectrum of applications, by
skewing the current trade-off towards the needs of common business users, and easing
the migration from current RDBMS. This is particularly so when used in the context
of a Cloud solution such as in a Platform as a Service (PaaS).
This thesis aims at reducing the gap between traditional RDBMS and large scale
data stores, by seeking mechanisms to provide additional consistency guarantees and
higher level data processing primitives in large scale data stores. The devised mechanisms
should not hinder the scalability and dependability of large scale data stores.
Regarding, higher level data processing primitives this thesis explores two complementary
approaches: by extending data stores with additional operations such as general
multi-item operations; and by coupling data stores with other existent processing
facilities without hindering scalability.
We address this challenges with a new architecture for large scale data stores, efficient
multi item access for large scale data stores, and SQL processing atop large scale
data stores. The novel architecture allows to find the right trade-offs among flexible
usage, efficiency, and fault-tolerance. To efficient support multi item access we extend first generation large scale data store’s data models with tags and a multi-tuple data
placement strategy, that allow to efficiently store and retrieve large sets of related data
at once. For efficient SQL support atop scalable data stores we devise design modifications
to existing relational SQL query engines, allowing them to be distributed.
We demonstrate our approaches with running prototypes and extensive experimental
evaluation using proper workloads.Os sistemas de armazenamento de dados de grande escala foram inicialmente desenvolvidos
para suportar um leque restrito de aplicacões de escala extrema, como as
redes sociais. Os requisitos de escalabilidade e elevada disponibilidade levaram a
sacrificar modelos de dados e processamento enriquecidos e até a coerência dos dados.
Em oposição aos tradicionais sistemas relacionais de gestão de bases de dados
(SRGBD), os sistemas de armazenamento de dados de grande escala apresentam modelos
de dados e APIs muito simples. Em particular, evidenciasse a ausência de muitas
das conhecidas operacões de gestão de dados relacionais e o relaxamento das garantias
de coerência, fornecendo coerência futura.
Atualmente, com o número de alternativas disponíveis e maduras, existe o crescente
interesse em usá-los num maior e diverso leque de aplicacões, orientando o atual
compromisso para as necessidades dos típicos clientes empresariais e facilitando a
migração a partir das atuais SRGBD. Isto é particularmente importante no contexto de
soluções cloud como plataformas como um servic¸o (PaaS).
Esta tese tem como objetivo reduzir a diferencça entre os tradicionais SRGDBs e os
sistemas de armazenamento de dados de grande escala, procurando mecanismos que
providenciem garantias de coerência mais fortes e primitivas com maior capacidade de
processamento. Os mecanismos desenvolvidos não devem comprometer a escalabilidade
e fiabilidade dos sistemas de armazenamento de dados de grande escala. No que
diz respeito às primitivas com maior capacidade de processamento esta tese explora
duas abordagens complementares : a extensão de sistemas de armazenamento de dados
de grande escala com operacões genéricas de multi objeto e a junção dos sistemas de armazenamento de dados de grande escala com mecanismos existentes de processamento
e interrogac¸ ˜ao de dados, sem colocar em causa a escalabilidade dos mesmos.
Para isso apresent´amos uma nova arquitetura para os sistemas de armazenamento
de dados de grande escala, acesso eficiente a m´ultiplos objetos, e processamento de
SQL sobre sistemas de armazenamento de dados de grande escala. A nova arquitetura
permite encontrar os compromissos adequados entre flexibilidade, eficiˆencia e
tolerˆancia a faltas. De forma a suportar de forma eficiente o acesso a m´ultiplos objetos
estendemos o modelo de dados de sistemas de armazenamento de dados de grande escala
da primeira gerac¸ ˜ao com palavras-chave e definimos uma estrat´egia de colocac¸ ˜ao
de dados para m´ultiplos objetos que permite de forma eficiente armazenar e obter
grandes quantidades de dados de uma s´o vez. Para o suporte eficiente de SQL sobre
sistemas de armazenamento de dados de grande escala, analisámos a arquitetura dos
motores de interrogação de SRGBDs e fizemos alterações que permitem que sejam
distribuídos.
As abordagens propostas são demonstradas através de protótipos e uma avaliacão
experimental exaustiva recorrendo a cargas adequadas baseadas em aplicações reais
E3: Emotions, Engagement, and Educational Digital Games
The use of educational digital games as a method of instruction for science, technology, engineering, and mathematics has increased in the past decade. While these games provide successfully implemented interactive and fun interfaces, they are not designed to respond or remedy students’ negative affect towards the game dynamics or their educational content. Therefore, this exploratory study investigated the frequent patterns of student emotional and behavioral response to educational digital games.
To unveil the sequential occurrence of these affective states, students were assigned to play the game for nine class sessions. During these sessions, their affective and behavioral response was recorded to uncover possible underlying patterns of affect (particularly confusion, frustration, and boredom) and behavior (disengagement). In addition, these affect and behavior frequency pattern data were combined with students’ gameplay data in order to identify patterns of emotions that led to a better performance in the game.
The results provide information on possible affect and behavior patterns that could be used in further research on affect and behavior detection in such open-ended digital game environments. Particularly, the findings show that students experience a considerable amount of confusion, frustration, and boredom. Another finding highlights the need for remediation via embedded help, as the students referred to peer help often during their gameplay. However, possibly because of the low quality of the received help, students seemed to become frustrated or disengaged with the environment. Finally, the findings suggest the importance of the decay rate of confusion; students’ gameplay performance was associated with the length of time students remained confused or frustrated. Overall, these findings show that there are interesting patterns related to students who experience relatively negative emotions during their gameplay