1,028 research outputs found
ElfStore: A Resilient Data Storage Service for Federated Edge and Fog Resources
Edge and fog computing have grown popular as IoT deployments become
wide-spread. While application composition and scheduling on such resources are
being explored, there exists a gap in a distributed data storage service on the
edge and fog layer, instead depending solely on the cloud for data persistence.
Such a service should reliably store and manage data on fog and edge devices,
even in the presence of failures, and offer transparent discovery and access to
data for use by edge computing applications. Here, we present Elfstore, a
first-of-its-kind edge-local federated store for streams of data blocks. It
uses reliable fog devices as a super-peer overlay to monitor the edge
resources, offers federated metadata indexing using Bloom filters, locates data
within 2-hops, and maintains approximate global statistics about the
reliability and storage capacity of edges. Edges host the actual data blocks,
and we use a unique differential replication scheme to select edges on which to
replicate blocks, to guarantee a minimum reliability and to balance storage
utilization. Our experiments on two IoT virtual deployments with 20 and 272
devices show that ElfStore has low overheads, is bound only by the network
bandwidth, has scalable performance, and offers tunable resilience.Comment: 24 pages, 14 figures, To appear in IEEE International Conference on
Web Services (ICWS), Milan, Italy, 201
HEC: Collaborative Research: SAM^2 Toolkit: Scalable and Adaptive Metadata Management for High-End Computing
The increasing demand for Exa-byte-scale storage capacity by high end computing applications requires a higher level of scalability and dependability than that provided by current file and storage systems. The proposal deals with file systems research for metadata management of scalable cluster-based parallel and distributed file storage systems in the HEC environment. It aims to develop a scalable and adaptive metadata management (SAM2) toolkit to extend features of and fully leverage the peak performance promised by state-of-the-art cluster-based parallel and distributed file storage systems used by the high performance computing community. There is a large body of research on data movement and management scaling, however, the need to scale up the attributes of cluster-based file systems and I/O, that is, metadata, has been underestimated. An understanding of the characteristics of metadata traffic, and an application of proper load-balancing, caching, prefetching and grouping mechanisms to perform metadata management correspondingly, will lead to a high scalability. It is anticipated that by appropriately plugging the scalable and adaptive metadata management components into the state-of-the-art cluster-based parallel and distributed file storage systems one could potentially increase the performance of applications and file systems, and help translate the promise and potential of high peak performance of such systems to real application performance improvements.
The project involves the following components:
1. Develop multi-variable forecasting models to analyze and predict file metadata access patterns. 2. Develop scalable and adaptive file name mapping schemes using the duplicative Bloom filter array technique to enforce load balance and increase scalability 3. Develop decentralized, locality-aware metadata grouping schemes to facilitate the bulk metadata operations such as prefetching. 4. Develop an adaptive cache coherence protocol using a distributed shared object model for client-side and server-side metadata caching. 5. Prototype the SAM2 components into the state-of-the-art parallel virtual file system PVFS2 and a distributed storage data caching system, set up an experimental framework for a DOE CMS Tier 2 site at University of Nebraska-Lincoln and conduct benchmark, evaluation and validation studies
GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine
Recent studies showed that single-machine graph processing systems can be as
highly competitive as cluster-based approaches on large-scale problems. While
several out-of-core graph processing systems and computation models have been
proposed, the high disk I/O overhead could significantly reduce performance in
many practical cases. In this paper, we propose GraphMP to tackle big graph
analytics on a single machine. GraphMP achieves low disk I/O overhead with
three techniques. First, we design a vertex-centric sliding window (VSW)
computation model to avoid reading and writing vertices on disk. Second, we
propose a selective scheduling method to skip loading and processing
unnecessary edge shards on disk. Third, we use a compressed edge cache
mechanism to fully utilize the available memory of a machine to reduce the
amount of disk accesses for edges. Extensive evaluations have shown that
GraphMP could outperform state-of-the-art systems such as GraphChi, X-Stream
and GridGraph by 31.6x, 54.5x and 23.1x respectively, when running popular
graph applications on a billion-vertex graph
Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets
2017 Summer.Includes bibliographical references.Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Digital provenance - models, systems, and applications
Data provenance refers to the history of creation and manipulation of a data object and is being widely used in various application domains including scientific experiments, grid computing, file and storage system, streaming data etc. However, existing provenance systems operate at a single layer of abstraction (workflow/process/OS) at which they record and store provenance whereas the provenance captured from different layers provide the highest benefit when integrated through a unified provenance framework. To build such a framework, a comprehensive provenance model able to represent the provenance of data objects with various semantics and granularity is the first step. In this thesis, we propose a such a comprehensive provenance model and present an abstract schema of the model. ^ We further explore the secure provenance solutions for distributed systems, namely streaming data, wireless sensor networks (WSNs) and virtualized environments. We design a customizable file provenance system with an application to the provenance infrastructure for virtualized environments. The system supports automatic collection and management of file provenance metadata, characterized by our provenance model. Based on the proposed provenance framework, we devise a mechanism for detecting data exfiltration attack in a file system. We then move to the direction of secure provenance communication in streaming environment and propose two secure provenance schemes focusing on WSNs. The basic provenance scheme is extended in order to detect packet dropping adversaries on the data flow path over a period of time. We also consider the issue of attack recovery and present an extensive incident response and prevention system specifically designed for WSNs
- …