Search CORE

66,864 research outputs found

ARIADNE: A NOVEL HIGH AVAILABILITY CLOUD DATA STORE WITH TRANSACTIONAL GUARANTEES

Author: Connor Alexander G.
Publication venue
Publication date: 24/01/2013
Field of study

Modern cloud data storage services have powerful capabilities for data-sets that can be indexed by a single key -- key-value stores -- and for data-sets that are characterized by multiple attributes (such as Google's BigTable). These data stores have non-ideal overheads, however, when graph data needs to be maintained; overheads are incurred because related (by graph edges) keys are managed in physically different host machines. We propose a new distributed data-storage paradigm, the key-key-value store, which extends the key-value model and significantly reduces these overheads by storing related keys in the same place. We provide a high-level description of our proposed system for storing large-scale, highly interconnected graph data -- such as social networks -- as well as an analysis of our key-key-value system in relation to existing work. In this thesis, we show how our novel data organization paradigm will facilitate improved levels of QoS in large graph data stores. Furthermore, we have built a system with our key-key-value system design -- Ariadne -- that is de-centralized, scalable, lightweight, relational and transactional. Such a system is unique among current systems in that it provides all qualities at once. This system was put to the test in the cloud using a strenuous concurrent workload and compared against the state of the art MySQL Cluster database system. Results show great promise for scalability and more consistent performance across workload types in Ariadne than in MySQL

D-Scholarship@Pitt

Archiving the Relaxed Consistency Web

Author: Jordan Ramiro
Liu Jinyang
Van de Sompel Herbert
van Reenen Johann
Xie Zhiwu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.Comment: 10 pages, 6 figures, CIKM 201

arXiv.org e-Print Archive

Crossref

Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

Author: Geambasu Roxana
Huang Tzu-Kuo
Lecuyer Mathias
Sen Siddhartha
Spahn Riley
Publication venue
Publication date: 21/05/2017
Field of study

Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data

arXiv.org e-Print Archive

Crossref