253 research outputs found
Qd-tree: Learning Data Layouts for Big Data Analytics
Corporations today collect data at an unprecedented and accelerating scale,
making the need to run queries on large datasets increasingly important.
Technologies such as columnar block-based data organization and compression
have become standard practice in most commercial database systems. However, the
problem of best assigning records to data blocks on storage is still open. For
example, today's systems usually partition data by arrival time into row
groups, or range/hash partition the data based on selected fields. For a given
workload, however, such techniques are unable to optimize for the important
metric of the number of blocks accessed by a query. This metric directly
relates to the I/O cost, and therefore performance, of most analytical queries.
Further, they are unable to exploit additional available storage to drive this
metric down further.
In this paper, we propose a new framework called a query-data routing tree,
or qd-tree, to address this problem, and propose two algorithms for their
construction based on greedy and deep reinforcement learning techniques.
Experiments over benchmark and real workloads show that a qd-tree can provide
physical speedups of more than an order of magnitude compared to current
blocking schemes, and can reach within 2X of the lower bound for data skipping
based on selectivity, while providing complete semantic descriptions of created
blocks.Comment: ACM SIGMOD 202
Recommended from our members
Efficient Latent Semantic Extraction from Cross Domain Data with Declarative Language
With large amounts of data continuously generated by intelligence devices, efficient analysis of huge data collections to unearth valuable insights has become one of the most elusive challenges for both academia and industry. The key elements to establishing a scalable analyzing framework should involve (1) an intuitive interface to describe the desired outcome, (2) a well-crafted model that integrates all available information sources to derive the optimal outcome and (3) an efficient algorithm that performs the data integration and extraction within a reasonable amount of time. In this dissertation, we address these challenges by proposing (1) a cross-language interface for a succinct expression of recursive queries, (2) a domain specific neural network model that can incorporate information of multiple modalities, and (3) a sample efficient training method that can be used even for extremely-large output-class classifiers. Our contributions in this thesis are thus threefold: First, for the ubiquitous recursive queries in advanced data analytics, on top of BigDatalog and Apache Spark, we design a succinct and expressive analytics tool encapsulating the functionality and classical algorithms of Datalog, a quintessential logic programming language. We provide the Logical Library (LLib), a Spark MLlib-like high-level API supporting a wide range of recursive algorithms and the Logical DataFrame (LFrame), an extension to Spark DataFrame supporting both relational and logical operations. The LLib and LFrame enable smooth collaborations between logical applications and other Spark libraries and cross-language logical programming in Scala, Java, or Python. Second, we utilize variants of recurrent neural network (RNN) to incorporate some enlightening sequential information overlooked by the conventional works in two different domains including Spoken Language Understanding (SLU) and Internet Embedding (IE). In SLU, we address the problem caused by solely relying on the first best interpretation (hypothesis) of an audio command through a series of new architectures comprising bidirectional LSTM and pooling layers to jointly utilize the other hypotheses' texts or embedding vectors, which are neglected but with valuable information missed by the first best hypothesis. In IE, we propose the DIP, an extension of RNN, to build up the internet coordinate system with the IP address sequences, which are also unnoticed in conventional distance-based internet embedding algorithms but encode structural information of the network. Both DIP and the integration of all hypotheses bring significant performance improvements for the corresponding downstream tasks. Finally, we investigate the training algorithm for multi-class classifiers with a large output-class size, which is common in deep neural networks and typically implemented as a softmax final layer with one output neuron per each class. To avoid expensive computing the intractable normalizing constant of softmax for each training data point, we analyze the well-known negative sampling and improve it to the amplified negative sampling algorithm, which gains much higher performance with lower training cost
Efficient compression of motion compensated residuals
EThOS - Electronic Theses Online ServiceGBUnited Kingdo
A Taxonomy for and Analysis of Anonymous Communications Networks
Any entity operating in cyberspace is susceptible to debilitating attacks. With cyber attacks intended to gather intelligence and disrupt communications rapidly replacing the threat of conventional and nuclear attacks, a new age of warfare is at hand. In 2003, the United States acknowledged that the speed and anonymity of cyber attacks makes distinguishing among the actions of terrorists, criminals, and nation states difficult. Even President Obama’s Cybersecurity Chief-elect recognizes the challenge of increasingly sophisticated cyber attacks. Now through April 2009, the White House is reviewing federal cyber initiatives to protect US citizen privacy rights. Indeed, the rising quantity and ubiquity of new surveillance technologies in cyberspace enables instant, undetectable, and unsolicited information collection about entities. Hence, anonymity and privacy are becoming increasingly important issues. Anonymization enables entities to protect their data and systems from a diverse set of cyber attacks and preserves privacy. This research provides a systematic analysis of anonymity degradation, preservation and elimination in cyberspace to enhance the security of information assets. This includes discovery/obfuscation of identities and actions of/from potential adversaries. First, novel taxonomies are developed for classifying and comparing well-established anonymous networking protocols. These expand the classical definition of anonymity and capture the peer-to-peer and mobile ad hoc anonymous protocol family relationships. Second, a unique synthesis of state-of-the-art anonymity metrics is provided. This significantly aids an entity’s ability to reliably measure changing anonymity levels; thereby, increasing their ability to defend against cyber attacks. Finally, a novel epistemic-based mathematical model is created to characterize how an adversary reasons with knowledge to degrade anonymity. This offers multiple anonymity property representations and well-defined logical proofs to ensure the accuracy and correctness of current and future anonymous network protocol design
Monitoring Computer Systems: An Intelligent Approach
Monitoring modern computer systems is increasingly difficult due to their peculiar characteristics. To cope with this situation, the dissertation develops an approach to intelligent monitoring. The resulting model consists of three major designs: representing targets, controlling data collection, and autonomously refining monitoring performance. The model explores a more declarative object-oriented model by introducing virtual objects to dynamically compose abstract representations, while it treats conventional hard-wired hierarchies and predefined object classes as primitive structures. Taking the representational framework as a reasoning bed, the design for controlling mechanisms adopts default reasoning backed up with ordered constraints, so that the amount of data collected, levels of details, semantics, and resolution of observation can be appropriately controlled. The refining mechanisms classify invoked knowledge and update the classified knowledge in terms of the feedback from monitoring. The approach is designed first and then formally specified. Applications of the resulting model are examined and an operational prototype is implemented. Thus the dissertation establishes a basis for an approach to intelligent monitoring, one which would be equipped to deal effectively with the difficulties that arise in monitoring modern computer systems
The 1995 Science Information Management and Data Compression Workshop
This document is the proceedings from the 'Science Information Management and Data Compression Workshop,' which was held on October 26-27, 1995, at the NASA Goddard Space Flight Center, Greenbelt, Maryland. The Workshop explored promising computational approaches for handling the collection, ingestion, archival, and retrieval of large quantities of data in future Earth and space science missions. It consisted of fourteen presentations covering a range of information management and data compression approaches that are being or have been integrated into actual or prototypical Earth or space science data information systems, or that hold promise for such an application. The Workshop was organized by James C. Tilton and Robert F. Cromp of the NASA Goddard Space Flight Center
Signal fingerprinting and machine learning framework for UAV detection and identification.
Advancement in technology has led to creative and innovative inventions. One such invention includes unmanned aerial vehicles (UAVs). UAVs (also known as drones) are now an intrinsic part of our society because their application is becoming ubiquitous in every industry ranging from transportation and logistics to environmental monitoring among others. With the numerous benign applications of UAVs, their emergence has added a new dimension to privacy and security issues. There are little or no strict regulations on the people that can purchase or own a UAV. For this reason, nefarious actors can take advantage of these aircraft to intrude into restricted or private areas. A UAV detection and identification system is one of the ways of detecting and identifying the presence of a UAV in an area. UAV detection and identification systems employ different sensing techniques such as radio frequency (RF) signals, video, sounds, and thermal imaging for detecting an intruding UAV. Because of the passive nature (stealth) of RF sensing techniques, the ability to exploit RF sensing for identification of UAV flight mode (i.e., flying, hovering, videoing, etc.), and the capability to detect a UAV at beyond visual line-of-sight (BVLOS) or marginal line-of-sight makes RF sensing techniques promising for UAV detection and identification. More so, there is constant communication between a UAV and its ground station (i.e., flight controller). The RF signals emitting from a UAV or UAV flight controller can be exploited for UAV detection and identification. Hence, in this work, an RF-based UAV detection and identification system is proposed and investigated. In RF signal fingerprinting research, the transient and steady state of the RF signals can be used to extract a unique signature. The first part of this work is to use two different wavelet analytic transforms (i.e., continuous wavelet transform and wavelet scattering transform) to investigate and analyze the characteristics or impacts of using either state for UAV detection and identification. Coefficient-based and image-based signatures are proposed for each of the wavelet analysis transforms to detect and identify a UAV. One of the challenges of using RF sensing is that a UAV\u27s communication links operate at the industrial, scientific, and medical (ISM) band. Several devices such as Bluetooth and WiFi operate at the ISM band as well, so discriminating UAVs from other ISM devices is not a trivial task. A semi-supervised anomaly detection approach is explored and proposed in this research to differentiate UAVs from Bluetooth and WiFi devices. Both time-frequency analytical approaches and unsupervised deep neural network techniques (i.e., denoising autoencoder) are used differently for feature extraction. Finally, a hierarchical classification framework for UAV identification is proposed for the identification of the type of unmanned aerial system signal (UAV or UAV controller signal), the UAV model, and the operational mode of the UAV. This is a shift from a flat classification approach. The hierarchical learning approach provides a level-by-level classification that can be useful for identifying an intruding UAV. The proposed frameworks described here can be extended to the detection of rogue RF devices in an environment
- …