319 research outputs found

    Sketch *-metric: Comparing Data Streams via Sketching

    Get PDF
    12 pages, double colonnesIn this paper, we consider the problem of estimating the distance between any two large data streams in small- space constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the Sketch ⋆-metric, which allows to define a distance between updatable summaries (or sketches) of large data streams. An important feature of the Sketch ⋆-metric is that, given a measure on the entire initial data streams, the Sketch ⋆-metric preserves the axioms of the latter measure on the sketch (such as the non-negativity, the identity, the symmetry, the triangle inequality but also specific properties of the f-divergence). Extensive experiments conducted on both synthetic traces and real data allow us to validate the robustness and accuracy of the Sketch ⋆-metric

    Essays on risk and uncertainty in financial decision making: Bayesian inference of multi-factor affine term structure models and dynamic optimal portfolio choices for robust preferences

    Full text link
    Thesis (Ph.D.)--Boston UniversityThis thesis studies model inference about risk and decision making under model uncertainty in two specific settings. The first part of the thesis develops a Bayesian Markov Chain Monte Carlo (MCMC) estimation method for multi-factor affine term structure models. Affine term structure models are popular because they provide closed-form solutions for the valuation of fixed income securities. Efficient estimation methods for parameters of these models, however, are not readily available. The MCMC algorithms developed provide more accurate estimates, compared with alternative estimation methods. The superior performance of the MCMC algorithms is first documented in a simulation study. Convergence of the algorithm used to sample posterior distributions is documented in numerical experiments. The Bayesian MCMC methodology is then applied to yield data. The in-sample pricing errors obtained are significantly smaller than those of alternative methods. A Bayesian forecast analysis documents the significant superior predictive power of the MCMC approach. Finally, Bayesian model selection criteria are discussed. Incorporating aspects of model uncertainty for the optimal allocation of risk has become an important topic in finance. The second part of the thesis considers an optimal dynamic portfolio choice problem for an ambiguity-averse investor. It introduces new preferences that allow the separation of risk and ambiguity aversion. The novel representation is based on generalized divergence measures that capture richer forms of model uncertainty than traditional relative entropy measures. The novel preferences are shown to have a homothetic stochastic differential utility representation. Based on this representation, optimal portfolio policies are derived using numerical schemes for forward-backward stochastic differential equations. The optimal portfolio policy is shown to contain new hedging motives induced by the investor's attitude toward model uncertainty. Ambiguity concerns introduce additional horizon effects, boost effective risk aversion, and overall reduce optimal investment in risky assets. These findings have important implications for the design of optimal portfolios in the presence of model uncertainty

    ENABLING TECHNIQUES FOR EXPRESSIVE FLOW FIELD VISUALIZATION AND EXPLORATION

    Get PDF
    Flow visualization plays an important role in many scientific and engineering disciplines such as climate modeling, turbulent combustion, and automobile design. The most common method for flow visualization is to display integral flow lines such as streamlines computed from particle tracing. Effective streamline visualization should capture flow patterns and display them with appropriate density, so that critical flow information can be visually acquired. In this dissertation, we present several approaches that facilitate expressive flow field visualization and exploration. First, we design a unified information-theoretic framework to model streamline selection and viewpoint selection as symmetric problems. Two interrelated information channels are constructed between a pool of candidate streamlines and a set of sample viewpoints. Based on these information channels, we define streamline information and viewpoint information to select best streamlines and viewpoints, respectively. Second, we present a focus+context framework to magnify small features and reduce occlusion around them while compacting the context region in a full view. This framework parititions the volume into blocks and deforms them to guide streamline repositioning. The desired deformation is formulated into energy terms and achieved by minimizing the energy function. Third, measuring the similarity of integral curves is fundamental to many tasks such as feature detection, pattern querying, streamline clustering and hierarchical exploration. We introduce FlowString that extracts shape invariant features from streamlines to form an alphabet of characters, and encodes each streamline into a string. The similarity of two streamline segments then becomes a specially designed edit distance between two strings. Leveraging the suffix tree, FlowString provides a string-based method for exploratory streamline analysis and visualization. A universal alphabet is learned from multiple data sets to capture basic flow patterns that exist in a variety of flow fields. This allows easy comparison and efficient query across data sets. Fourth, for exploration of vascular data sets, which contain a series of vector fields together with multiple scalar fields, we design a web-based approach for users to investigate the relationship among different properties guided by histograms. The vessel structure is mapped from the 3D volume space to a 2D graph, which allow more efficient interaction and effective visualization on websites. A segmentation scheme is proposed to divide the vessel structure based on a user specified property to further explore the distribution of that property over space

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Automata Theory on Sliding Windows

    Get PDF
    In a recent paper we analyzed the space complexity of streaming algorithms whose goal is to decide membership of a sliding window to a fixed language. For the class of regular languages we proved a space trichotomy theorem: for every regular language the optimal space bound is either constant, logarithmic or linear. In this paper we continue this line of research: We present natural characterizations for the constant and logarithmic space classes and establish tight relationships to the concept of language growth. We also analyze the space complexity with respect to automata size and prove almost matching lower and upper bounds. Finally, we consider the decision problem whether a language given by a DFA/NFA admits a sliding window algorithm using logarithmic/constant space

    Mining complex data in highly streaming environments

    Get PDF
    Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs
    • 

    corecore