146 research outputs found
BF-Tree: Approximate Tree Indexing
The increasing volume of time-based generated data and the shift in storage technologies suggest that we might need to reconsider indexing. Several workloads - like social and service monitoring - often include attributes with implicit clustering because of their time-dependent nature. In addition, solid state disks (SSD) (using flash or other low-level technologies) emerge as viable competitors of hard disk drives (HDD). Capacity and access times of storage devices create a trade-off between SSD and HDD. Slow random accesses in HDD have been replaced by efficient random accesses in SSD, but their available capacity is one or more orders of magnitude more expensive than the one of HDD. Indexing, however, is designed assuming HDD as secondary storage, thus minimizing random accesses at the expense of capacity. Indexing data using SSD as secondary storage requires treating capacity as a scarce resource. To this end, we introduce approximate tree indexing, which employs probabilistic data structures (Bloom filters) to trade accuracy for size and produce smaller, yet powerful, tree indexes, which we name Bloom filter trees (BF-Trees). BF-Trees exploit pre-existing data ordering or partitioning to offer competitive search performance. We demonstrate, both by an analytical study and by experimental results, that by using workload knowledge and reducing indexing accuracy up to some extent, we can save substantially on capacity when indexing on ordered or partitioned attributes. In particular, in experiments with a synthetic workload, approximate indexing offers 2.22x-48x smaller index footprint with competitive response times, and in experiments with TPCH and a monitoring real-life dataset from an energy company, it offers 1.6x-4x smaller index footprint with competitive search times as well
Power and Memory Efficient Hashing Schemes for Some Network Applications
Hash tables (HTs) are used to implement various lookup schemes and they need
to be efficient in terms of speed, space utilization, and power consumptions. For IP
lookup, the hashing schemes are attractive due to their deterministic O(1) lookup
performance and low power consumptions, in contrast to the TCAM and Trie based
approaches. As the size of IP lookup table grows exponentially, scalable lookup
performance is highly desirable. For next generation high-speed routers, this is a
vital requirement when IP lookup remains in the critical data path and demands a
predictable throughput. However, recently proposed hash schemes, like a Bloomier
filter HT and a Fast HT (FHT) suffer from a number of flaws, including setup failures,
update overheads, duplicate keys, and pointer overheads. In this dissertation, four
novel hashing schemes and their architectures are proposed to address the above
concerns by using pipelined Bloom filters and a Fingerprint filter which are designed
for a memory-efficient approximate match. For IP lookups, two new hash schemes
such as a Hierarchically Indexed Hash Table (HIHT) and Fingerprint-based Hash
Table (FPHT) are introduced to achieve a a perfect match is assured without pointer
overhead. Further, two hash mechanisms are also proposed to provide memory and
power efficient lookup for packet processing applications.
Among four proposed schemes, the HIHT and the FPHT schemes are evaluated for their performance and compared with TCAM and Trie based IP lookup schemes.
Various sizes of IP lookup tables are considered to demonstrate scalability in terms
of speed, memory use, and power consumptions. While an FPHT uses less memory
than an HIHT, an FPHT-based IP lookup scheme reduces power consumption by a
factor of 51 and requires 1.8 times memory compared to TCAM-based and trie-based
IP lookup schemes, respectively. In dissertation, a multi-tiered packet classifier has
been proposed that saves at most 3.2 times power compared to the existing parallel
packet classifier.
Intrinsic hashing schemes lack of high throughput, unlike partitioned Ternary
Content Addressable Memory (TCAM)-based scheme that are capable of parallel
lookups despite large power consumption. A hybrid CAM (HCAM) architecture has
been introduced. Simulation results indicate HCAM to achieve the same throughput
as contemporary schemes while it uses 2.8 times less memory and 3.6 times less power
compared to the contemporary schemes
Algorithms and Architectures for Network Search Processors
The continuous growth in the Internet’s size, the amount of data traffic, and the complexity of processing this traffic gives rise to new challenges in building high-performance network devices. One of the most fundamental tasks performed by these devices is searching the network data for predefined keys. Address lookup, packet classification, and deep packet inspection are some of the operations which involve table lookups and searching. These operations are typically part of the packet forwarding mechanism, and can create a performance bottleneck. Therefore, fast and resource efficient algorithms are required. One of the most commonly used techniques for such searching operations is the Ternary Content Addressable Memory (TCAM). While TCAM can offer very fast search speeds, it is costly and consumes a large amount of power. Hence, designing cost-effective, power-efficient, and high-speed search techniques has received a great deal of attention in the research and industrial community. In this thesis, we propose a generic search technique based on Bloom filters. A Bloom filter is a randomized data structure used to represent a set of bit-strings compactly and support set membership queries. We demonstrate techniques to convert the search process into table lookups. The resulting table data structures are kept in the off-chip memory and their Bloom filter representations are kept in the on-chip memory. An item needs to be looked up in the off-chip table only when it is found in the on-chip Bloom filters. By filtering the off-chip memory accesses in this fashion, the search operations can be significantly accelerated. Our approach involves a unique combination of algorithmic and architectural techniques that outperform some of the current techniques in terms of cost-effectiveness, speed, and power-efficiency
Distributed spatial query processing and optimization
x, 76 leaves ; 29 cmApplications exist today that require the management of distributed spatial data. Since
spatial data is more complex than non-spatial data, performing queries on it requires more
local processing (i.e. CPU and I/O) time. Also, due to geographical distribution, data
transmission costs must be considered. To reduce these costs, one can employ a distributed
spatial semijoin as it eliminates unnecessary objects before their transmission to other sites
and the query site.
Most existing work propose different representations of the distributed spatial semijoin
between two sites only, with very few works exploring its use for processing a query
involving more than two sites. In this thesis, we propose both new approaches for representing
the spatial semijoin in a distributed setting, and their use for processing a distributed
query consisting of any number of sites. Two strategies are proposed for compactly representing
the spatial semijoin that reduce both the data transmission and local processing
(CPU+I/O) costs when applied in a distributed spatial query. A Global Encompassing Minimum
Bounding Rectangle (GEMBR) is utilized, which is partitioned, mapped and applied
in two different ways to approximate the objects in a spatial joining attribute. The first is
partition indices, while the second is a bit array representation. Then each spatial semijoin
is applied in a multi-site distributed spatial query processing strategy. In addition, the
two-site spatial semijoin is extended to handle multiple sites so that we have a benchmark
strategy for comparison purposes.
We have tested the query processing algorithms for four sites, which are a part of an
actual working distributed system. The algorithms are compared with respect to data transmission
cost, CPU time, I/O time and false positive results. The algorithms are superior in
many cases at optimizing the above criteria. The bit array representation, which is called Bloom Filter Based Spatial Semijoin (BFSJ), is evaluated with respect to different filter factors and found that the optimized algorithms perform significantly better than the Distributed
Na¨ıve Spatial Semijoin strategy when synthetic data was used. Also the Partition
and Mapping Based Spatial Semijoin (PMSJ) is 1.38 times faster than BFSJ with respect
to processing cost while the BFSJ has a tranmission cost gain of 1.12 over PMSJ. Both
algorithms are 18 times faster and have six times less transmission cost than Distributed
Na¨ıve Spatial Semijoin (NSPJ). Finally, it is also observed that with the increase of hash
functions and filter factor the false positive percentage increases
Data Structures and Algorithms for Scalable NDN Forwarding
Named Data Networking (NDN) is a recently proposed general-purpose network architecture that aims to address the limitations of the Internet Protocol (IP), while maintaining its strengths. NDN takes an information-centric approach, focusing on named data rather than computer addresses. In NDN, the content is identified by its name, and each NDN packet has a name that specifies the content it is fetching or delivering. Since there are no source and destination addresses in an NDN packet, it is forwarded based on a lookup of its name in the forwarding plane, which consists of the Forwarding Information Base (FIB), Pending Interest Table (PIT), and Content Store (CS). In addition, as an in-network caching element, a scalable Repository (Repo) design is needed to provide large-scale long-term content storage in NDN networks.
Scalable NDN forwarding is a challenge. Compared to the well-understood approaches to IP forwarding, NDN forwarding performs lookups on packet names, which have variable and unbounded lengths, increasing the lookup complexity. The lookup tables are larger than in IP, requiring more memory space. Moreover, NDN forwarding has a read-write data plane, requiring per-packet updates at line rates. Designing and evaluating a scalable NDN forwarding node architecture is a major effort within the overall NDN research agenda.
The goal of this dissertation is to demonstrate that scalable NDN forwarding is feasible with the proposed data structures and algorithms. First, we propose a FIB lookup design based on the binary search of hash tables that provides a reliable longest name prefix lookup performance baseline for future NDN research. We have demonstrated 10 Gbps forwarding throughput with 256-byte packets and one billion synthetic forwarding rules, each containing up to seven name components. Second, we explore data structures and algorithms to optimize the FIB design based on the specific characteristics of real-world forwarding datasets. Third, we propose a fingerprint-only PIT design that reduces the memory requirements in the core routers. Lastly, we discuss the Content Store design issues and demonstrate that the NDN Repo implementation can leverage many of the existing databases and storage systems to improve performance
Privacy preserving linkage and sharing of sensitive data
2018 Summer.Includes bibliographical references.Sensitive data, such as personal and business information, is collected by many service providers nowadays. This data is considered as a rich source of information for research purposes that could benet individuals, researchers and service providers. However, because of the sensitivity of such data, privacy concerns, legislations, and con ict of interests, data holders are reluctant to share their data with others. Data holders typically lter out or obliterate privacy related sensitive information from their data before sharing it, which limits the utility of this data and aects the accuracy of research. Such practice will protect individuals' privacy; however it prevents researchers from linking records belonging to the same individual across dierent sources. This is commonly referred to as record linkage problem by the healthcare industry. In this dissertation, our main focus is on designing and implementing ecient privacy preserving methods that will encourage sensitive information sources to share their data with researchers without compromising the privacy of the clients or aecting the quality of the research data. The proposed solution should be scalable and ecient for real-world deploy- ments and provide good privacy assurance. While this problem has been investigated before, most of the proposed solutions were either considered as partial solutions, not accurate, or impractical, and therefore subject to further improvements. We have identied several issues and limitations in the state of the art solutions and provided a number of contributions that improve upon existing solutions. Our rst contribution is the design of privacy preserving record linkage protocol using semi-trusted third party. The protocol allows a set of data publishers (data holders) who compete with each other, to share sensitive information with subscribers (researchers) while preserving the privacy of their clients and without sharing encryption keys. Our second contribution is the design and implementation of a probabilistic privacy preserving record linkage protocol, that accommodates discrepancies and errors in the data such as typos. This work builds upon the previous work by linking the records that are similar, where the similarity range is formally dened. Our third contribution is a protocol that performs information integration and sharing without third party services. We use garbled circuits secure computation to design and build a system to perform the record linkages between two parties without sharing their data. Our design uses Bloom lters as inputs to the garbled circuits and performs a probabilistic record linkage using the Dice coecient similarity measure. As garbled circuits are known for their expensive computations, we propose new approaches that reduce the computation overhead needed, to achieve a given level of privacy. We built a scalable record linkage system using garbled circuits, that could be deployed in a distributed computation environment like the cloud, and evaluated its security and performance. One of the performance issues for linking large datasets is the amount of secure computation to compare every pair of records across the linked datasets to nd all possible record matches. To reduce the amount of computations a method, known as blocking, is used to lter out as much as possible of the record pairs that will not match, and limit the comparison to a subset of the record pairs (called can- didate pairs) that possibly match. Most of the current blocking methods either require the parties to share blocking keys (called blocks identiers), extracted from the domain of some record attributes (termed blocking variables), or share reference data points to group their records around these points using some similarity measures. Though these methods reduce the computation substantially, they leak too much information about the records within each block. Toward this end, we proposed a novel privacy preserving approximate blocking scheme that allows parties to generate the list of candidate pairs with high accuracy, while protecting the privacy of the records in each block. Our scheme is congurable such that the level of performance and accuracy could be achieved according to the required level of privacy. We analyzed the accuracy and privacy of our scheme, implemented a prototype of the scheme, and experimentally evaluated its accuracy and performance against dierent levels of privacy
Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets
2017 Summer.Includes bibliographical references.Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks
- …