7,138 research outputs found
On the information theory of clustering, registration, and blockchains
Progress in data science depends on the collection and storage of large volumes of reliable data, efficient and consistent inference based on this data, and trusting such computations made by untrusted peers. Information theory provides the means to analyze statistical inference algorithms, inspires the design of statistically consistent learning algorithms, and informs the design of large-scale systems for information storage and sharing. In this thesis, we focus on the problems of reliability, universality, integrity, trust, and provenance in data storage, distributed computing, and information processing algorithms and develop technical solutions and mathematical insights using information-theoretic tools.
In unsupervised information processing we consider the problems of data clustering and image registration. In particular, we evaluate the performance of the max mutual information method for image registration by studying its error exponent and prove its universal asymptotic optimality. We further extend this to design the max multiinformation method for universal multi-image registration and prove its universal asymptotic optimality. We then evaluate the non-asymptotic performance of image registration to understand the effects of the properties of the image transformations and the channel noise on the algorithms.
In data clustering we study the problem of independence clustering of sources using multivariate information functionals. In particular, we define consistent image clustering algorithms using the cluster information, and define a new multivariate information functional called illum information that inspires other independence clustering methods. We also consider the problem of clustering objects based on labels provided by temporary and long-term workers in a crowdsourcing platform. Here we define budget-optimal universal clustering algorithms using distributional identicality and temporal dependence in the responses of workers.
For the problem of reliable data storage, we consider the use of blockchain systems, and design secure distributed storage codes to reduce the cost of cold storage of blockchain ledgers. Additionally, we use dynamic zone allocation strategies to enhance the integrity and confidentiality of these systems, and frame optimization problems for designing codes applicable for cloud storage and data insurance.
Finally, for the problem of establishing trust in computations over untrusting peer-to-peer networks, we develop a large-scale blockchain system by defining the validation protocols and compression scheme to facilitate an efficient audit of computations that can be shared in a trusted manner across peers over the immutable blockchain ledger. We evaluate the system over some simple synthetic computational experiments and highlights its capacity in identifying anomalous computations and enhancing computational integrity
Recommended from our members
Statistical Recovery of Discrete, Geometric and Invariant Structures
The main objective of the workshop was to bring together researchers in mathematical statistics and related areas in order to discuss recent advances and problems associated with statistical recovery of geometric and invariant structures. Topics include adaptive estimation, confidence sets and testing techniques, as well as statistical algorithms for geometrical structure recovery and data analysis
Recommended from our members
Improved integration of information to reduce subsurface model bias
Subsurface modeling deals with data-related issues like cognitive and sampling biases, and model-related challenges including statistical assumptions, misspecification, and algorithmic biases. These challenges introduce four critical implications during subsurface modeling. Firstly, subsurface sampling is subject to sampling bias, which compromises statistical representativeness. Secondly, analog selection methodologies rely on multivariate statistics and expert judgment that overlook spatial information and data dimensionality. Thirdly, subsurface inferential workflows that utilize dimensionality reduction seldom provide repeatable frameworks that maintain model stability and are invariant to Euclidean transformations. Lastly, deep learning methods for dimensionality reduction, characterized as black-box models, lack interpretability and robust evaluation metrics, increasing susceptibility to algorithmic bias. Consequently, neglecting these challenges in subsurface modeling could lead to erroneous predictions, inconsistent inferences, diminished model reliability, and suboptimal decision-making that impacts project economics.
This dissertation integrates information within subsurface models to reduce model bias and significantly improve their accuracy, robustness, and generalizability. First, I create spatial declustering methods to debias spatial datasets with single and multiscale preferential sampling in stationary populations. Second, I introduce a novel geostatistics-based machine learning method for identifying subsurface resource analogs that integrate spatial information in subsurface datasets with high dimensionality. Next, I efficiently combine machine learning and computational geometry methods to stabilize lower dimensional spaces for uncertainty quantification and interpretation. Finally, I create a methodology to assess, evaluate, and interpret the stability of deep learning latent feature spaces.
These novel methodologies demonstrate the importance of improved techniques for information integration in subsurface modeling and show better results over naïve methods. This results in objective sampling debiasing in spatial stationary populations with single or multiple data scales, improving statistical representativity. Also, the results show better generalization and accurate identification of spatial analogs in high-dimensional datasets. Moreover, the methods yield Euclidean transformation-invariant lower-dimensional spaces, ensuring unique and repeatable solutions that improve model reliability and interpretability, for rational comparisons. Finally, the results indicate that deep learning models for dimensionality reduction exhibit algorithmic biases and instabilities, including sample, structural, and inferential instability, affecting their reliability and interpretability. Together, these innovations ultimately reduce model bias and significantly improve subsurface modeling.Petroleum and Geosystems Engineerin
IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY
13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio
Autonomous Exploration of Large-Scale Natural Environments
This thesis addresses issues which arise when using robotic platforms to explore large-scale, natural environments. Two main problems are identified: the volume of data collected by autonomous platforms and the complexity of planning surveys in large environments. Autonomous platforms are able to rapidly accumulate large data sets. The volume of data that must be processed is often too large for human experts to analyse exhaustively in a practical amount of time or in a cost-effective manner. This burden can create a bottleneck in the process of converting observations into scientifically relevant data. Although autonomous platforms can collect precisely navigated, high-resolution data, they are typically limited by finite battery capacities, data storage and computational resources. Deployments are also limited by project budgets and time frames. These constraints make it impractical to sample large environments exhaustively. To use the limited resources effectively, trajectories which maximise the amount of information gathered from the environment must be designed. This thesis addresses these problems. Three primary contributions are presented: a new classifier designed to accept probabilistic training targets rather than discrete training targets; a semi-autonomous pipeline for creating models of the environment; and an offline method for autonomously planning surveys. These contributions allow large data sets to be processed with minimal human intervention and promote efficient allocation of resources. In this thesis environmental models are established by learning the correlation between data extracted from a digital elevation model (DEM) of the seafloor and habitat categories derived from in-situ images. The DEM of the seafloor is collected using ship-borne multibeam sonar and the in-situ images are collected using an autonomous underwater vehicle (AUV). While the thesis specifically focuses on mapping and exploring marine habitats with an AUV, the research applies equally to other applications such as aerial and terrestrial environmental monitoring and planetary exploration
- …