60 research outputs found
ULTRA-FAST AND MEMORY-EFFICIENT LOOKUPS FOR CLOUD, NETWORKED SYSTEMS, AND MASSIVE DATA MANAGEMENT
Systems that process big data (e.g., high-traffic networks and large-scale storage) prefer data structures and algorithms with small memory and fast processing speed. Efficient and fast algorithms play an essential role in system design, despite the improvement of hardware. This dissertation is organized around a novel algorithm called Othello Hashing. Othello Hashing supports ultra-fast and memory-efficient key-value lookup, and it fits the requirements of the core algorithms of many large-scale systems and big data applications. Using Othello hashing, combined with domain expertise in cloud, computer networks, big data, and bioinformatics, I developed the following applications that resolve several major challenges in the area.
Concise: Forwarding Information Base. A Forwarding Information Base is a data structure used by the data plane of a forwarding device to determine the proper forwarding actions for packets. The polymorphic property of Othello Hashing the separation of its query and control functionalities, which is a perfect match to the programmable networks such as Software Defined Networks. Using Othello Hashing, we built a fast and scalable FIB named \textit{Concise}. Extensive evaluation results on three different platforms show that Concise outperforms other FIB designs.
SDLB: Cloud Load Balancer. In a cloud network, the layer-4 load balancer servers is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. We built a software load balancer with Othello Hashing techniques named SDLB. SDLB is able to accomplish two functionalities of the SDLB using one Othello query: to find the designated server for packets of ongoing sessions and to distribute new or session-free packets.
MetaOthello: Taxonomic Classification of Metagenomic Sequences. Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. We built a system to support efficient classification of taxonomic sequences using its k-mer signatures.
SeqOthello: RNA-seq Sequence Search Engine. Advances in the study of functional genomics produced a vast supply of RNA-seq datasets. However, how to quickly query and extract information from sequencing resources remains a challenging problem and has been the bottleneck for the broader dissemination of sequencing efforts. The challenge resides in both the sheer volume of the data and its nature of unstructured representation. Using the Othello Hashing techniques, we built the SeqOthello sequence search engine. SeqOthello is a reference-free, alignment-free, and parameter-free sequence search system that supports arbitrary sequence query against large collections of RNA-seq experiments, which enables large-scale integrative studies using sequence-level data
Design of Overlay Networks for Internet Multicast - Doctoral Dissertation, August 2002
Multicast is an efficient transmission scheme for supporting group communication in networks. Contrasted with unicast, where multiple point-to-point connections must be used to support communications among a group of users, multicast is more efficient because each data packet is replicated in the network – at the branching points leading to distinguished destinations, thus reducing the transmission load on the data sources and traffic load on the network links. To implement multicast, networks need to incorporate new routing and forwarding mechanisms in addition to the existing are not adequately supported in the current networks. The IP multicast are not adequately supported in the current networks. The IP multicast solution has serious scaling and deployment limitations, and cannot be easily extended to provide more enhanced data services. Furthermore, and perhaps most importantly, IP multicast has ignored the economic nature of the problem, lacking incentives for service providers to deploy the service in wide area networks. Overlay multicast holds promise for the realization of large scale Internet multicast services. An overlay network is a virtual topology constructed on top of the Internet infrastructure. The concept of overlay networks enables multicast to be deployed as a service network rather than a network primitive mechanism, allowing deployment over heterogeneous networks without the need of universal network support. This dissertation addresses the network design aspects of overlay networks to provide scalable multicast services in the Internet. The resources and the network cost in the context of overlay networks are different from that in conventional networks, presenting new challenges and new problems to solve. Our design goal are the maximization of network utility and improved service quality. As the overall network design problem is extremely complex, we divide the problem into three components: the efficient management of session traffic (multicast routing), the provisioning of overlay network resources (bandwidth dimensioning) and overlay topology optimization (service placement). The combined solution provides a comprehensive procedure for planning and managing an overlay multicast network. We also consider a complementary form of overlay multicast called application-level multicast (ALMI). ALMI allows end systems to directly create an overlay multicast session among themselves. This gives applications the flexibility to communicate without relying on service provides. The tradeoff is that users do not have direct control on the topology and data paths taken by the session flows and will typically get lower quality of service due to the best effort nature of the Internet environment. ALMI is therefore suitable for sessions of small size or sessions where all members are well connected to the network. Furthermore, the ALMI framework allows us to experiment with application specific components such as data reliability, in order to identify a useful set of communication semantic for enhanced data services
Praktické datové struktury
V tĂ©to práci implementujeme datovĂ© struktury pro uspořádanĂ© a neuspořádanĂ© slovnĂky a měřĂme jejich vĂ˝kon v hlavnĂ pamÄ›ti pomocĂ syntetickĂ˝ch i praktickĂ˝ch experimentĹŻ. Náš prĹŻzkum zahrnuje jak obvyklĂ© datovĂ© struktury (B-stromy, ÄŤerveno-ÄŤernĂ© stromy, splay stromy a hashovánĂ), tak exotiÄŤtÄ›jšà pĹ™Ăstupy (k-splay stromy a k-lesy). Powered by TCPDF (www.tcpdf.org)In this thesis, we implement several data structures for ordered and unordered dictionaries and we benchmark their performance in main memory on synthetic and practical workloads. Our survey includes both well-known data structures (B-trees, red-black trees, splay trees and hashing) and more exotic approaches (k-splay trees and k-forests). Powered by TCPDF (www.tcpdf.org)Department of Applied MathematicsKatedra aplikovanĂ© matematikyMatematicko-fyzikálnĂ fakultaFaculty of Mathematics and Physic
Algorithms for Large-Scale Internet Measurements
As the Internet has grown in size and importance to society, it has become
increasingly difficult to generate global metrics of interest that can be used to verify
proposed algorithms or monitor performance. This dissertation tackles the problem
by proposing several novel algorithms designed to perform Internet-wide measurements
using existing or inexpensive resources.
We initially address distance estimation in the Internet, which is used by many
distributed applications. We propose a new end-to-end measurement framework
called Turbo King (T-King) that uses the existing DNS infrastructure and, when
compared to its predecessor King, obtains delay samples without bias in the presence
of distant authoritative servers and forwarders, consumes half the bandwidth, and
reduces the impact on caches at remote servers by several orders of magnitude.
Motivated by recent interest in the literature and our need to find remote DNS
nameservers, we next address Internet-wide service discovery by developing IRLscanner,
whose main design objectives have been to maximize politeness at remote networks,
allow scanning rates that achieve coverage of the Internet in minutes/hours
(rather than weeks/months), and significantly reduce administrator complaints. Using
IRLscanner and 24-hour scan durations, we perform 20 Internet-wide experiments
using 6 different protocols (i.e., DNS, HTTP, SMTP, EPMAP, ICMP and UDP
ECHO). We analyze the feedback generated and suggest novel approaches for reducing
the amount of blowback during similar studies, which should enable researchers
to collect valuable experimental data in the future with significantly fewer hurdles.
We finally turn our attention to Intrusion Detection Systems (IDS), which are
often tasked with detecting scans and preventing them; however, it is currently unknown
how likely an IDS is to detect a given Internet-wide scan pattern and whether
there exist sufficiently fast stealth techniques that can remain virtually undetectable
at large-scale. To address these questions, we propose a novel model for the windowexpiration
rules of popular IDS tools (i.e., Snort and Bro), derive the probability that
existing scan patterns (i.e., uniform and sequential) are detected by each of these
tools, and prove the existence of stealth-optimal patterns
Rank-aware, Approximate Query Processing on the Semantic Web
Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion
Scalable adaptive group communication on bi-directional shared prefix trees
Efficient group communication within the Internet has been implemented by
multicast. Unfortunately, its global deployment is missing. Nevertheless,
emerging and progressively establishing popular applications, like IPTV or
large-scale social video chats, require an economical data distribution
throughout the Internet. To overcome the limitations of multicast deployment,
we introduce and analyze BIDIR-SAM, the rest structured overlay multicast
scheme based on bi-directional shared prefix trees. BIDIR-SAM admits
predictable costs growing logarithmically with increasing group size. We also
present a broadcast approach for DHT-enabled P2P networks. Both schemes are
integrated in a standard compliant hybrid group communication architecture,
bridging the gap between overlay and underlay as well as between inter- and
intra-domain multicast
Efficient Algorithms for Large-Scale Image Analysis
This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications
Atum: Scalable Group Communication Using Volatile Groups
This paper presents Atum, a group communication middleware for a large, dynamic, and hostile environment. At the heart of Atum lies the novel concept of volatile groups: small, dynamic groups of nodes, each executing a state machine replication protocol, organized in a flexible overlay. Using volatile groups, Atum scatters faulty nodes evenly among groups, and then masks each individual fault inside its group. To broadcast messages among volatile groups, Atum runs a gossip protocol across the overlay. We report on our synchronous and asynchronous (eventually synchronous) implementations of Atum, as well as on three representative applications that we build on top of it: A publish/subscribe platform, a file sharing service, and a data streaming system. We show that (a) Atum can grow at an exponential rate beyond 1000 nodes and disseminate messages in polylogarithmic time (conveying good scalability); (b) it smoothly copes with 18% of nodes churning every minute; and (c) it is impervious to arbitrary faults, suffering no performance decay despite 5.8% Byzantine nodes in a system of 850 nodes
- …