7,138 research outputs found

    On the information theory of clustering, registration, and blockchains

    Get PDF
    Progress in data science depends on the collection and storage of large volumes of reliable data, efficient and consistent inference based on this data, and trusting such computations made by untrusted peers. Information theory provides the means to analyze statistical inference algorithms, inspires the design of statistically consistent learning algorithms, and informs the design of large-scale systems for information storage and sharing. In this thesis, we focus on the problems of reliability, universality, integrity, trust, and provenance in data storage, distributed computing, and information processing algorithms and develop technical solutions and mathematical insights using information-theoretic tools. In unsupervised information processing we consider the problems of data clustering and image registration. In particular, we evaluate the performance of the max mutual information method for image registration by studying its error exponent and prove its universal asymptotic optimality. We further extend this to design the max multiinformation method for universal multi-image registration and prove its universal asymptotic optimality. We then evaluate the non-asymptotic performance of image registration to understand the effects of the properties of the image transformations and the channel noise on the algorithms. In data clustering we study the problem of independence clustering of sources using multivariate information functionals. In particular, we define consistent image clustering algorithms using the cluster information, and define a new multivariate information functional called illum information that inspires other independence clustering methods. We also consider the problem of clustering objects based on labels provided by temporary and long-term workers in a crowdsourcing platform. Here we define budget-optimal universal clustering algorithms using distributional identicality and temporal dependence in the responses of workers. For the problem of reliable data storage, we consider the use of blockchain systems, and design secure distributed storage codes to reduce the cost of cold storage of blockchain ledgers. Additionally, we use dynamic zone allocation strategies to enhance the integrity and confidentiality of these systems, and frame optimization problems for designing codes applicable for cloud storage and data insurance. Finally, for the problem of establishing trust in computations over untrusting peer-to-peer networks, we develop a large-scale blockchain system by defining the validation protocols and compression scheme to facilitate an efficient audit of computations that can be shared in a trusted manner across peers over the immutable blockchain ledger. We evaluate the system over some simple synthetic computational experiments and highlights its capacity in identifying anomalous computations and enhancing computational integrity

    IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY

    Get PDF
    13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio

    Autonomous Exploration of Large-Scale Natural Environments

    Get PDF
    This thesis addresses issues which arise when using robotic platforms to explore large-scale, natural environments. Two main problems are identified: the volume of data collected by autonomous platforms and the complexity of planning surveys in large environments. Autonomous platforms are able to rapidly accumulate large data sets. The volume of data that must be processed is often too large for human experts to analyse exhaustively in a practical amount of time or in a cost-effective manner. This burden can create a bottleneck in the process of converting observations into scientifically relevant data. Although autonomous platforms can collect precisely navigated, high-resolution data, they are typically limited by finite battery capacities, data storage and computational resources. Deployments are also limited by project budgets and time frames. These constraints make it impractical to sample large environments exhaustively. To use the limited resources effectively, trajectories which maximise the amount of information gathered from the environment must be designed. This thesis addresses these problems. Three primary contributions are presented: a new classifier designed to accept probabilistic training targets rather than discrete training targets; a semi-autonomous pipeline for creating models of the environment; and an offline method for autonomously planning surveys. These contributions allow large data sets to be processed with minimal human intervention and promote efficient allocation of resources. In this thesis environmental models are established by learning the correlation between data extracted from a digital elevation model (DEM) of the seafloor and habitat categories derived from in-situ images. The DEM of the seafloor is collected using ship-borne multibeam sonar and the in-situ images are collected using an autonomous underwater vehicle (AUV). While the thesis specifically focuses on mapping and exploring marine habitats with an AUV, the research applies equally to other applications such as aerial and terrestrial environmental monitoring and planetary exploration
    corecore