492,242 research outputs found

    Data Reduction and Deep-Learning Based Recovery for Geospatial Visualization and Satellite Imagery

    Get PDF
    The storage, retrieval and distribution of data are some critical aspects of big data management. Data scientists and decision-makers often need to share large datasets and make decisions on archiving or deleting historical data to cope with resource constraints. As a consequence, there is an urgency of reducing the storage and transmission requirement. A potential approach to mitigate such problems is to reduce big datasets into smaller ones, which will not only lower storage requirements but also allow light load transfer over the network. The high dimensional data often exhibit high repetitiveness and paradigm across different dimensions. Carefully prepared data by removing redundancies, along with a machine learning model capable of reconstructing the whole dataset from its reduced version, can improve the storage scalability, data transfer, and speed up the overall data management pipeline. In this thesis, we explore some data reduction strategies for big datasets, while ensuring that the data can be transferred and used ubiquitously by all stakeholders, i.e., the entire dataset can be reconstructed with high quality whenever necessary. One of our data reduction strategies follows a straightforward uniform pattern, which guarantees a minimum of 75% data size reduction. We also propose a novel variance based reduction technique, which focuses on removing only redundant data and offers additional 1% to 2% deletion rate. We have adopted various traditional machine learning and deep learning approaches for high-quality reconstruction. We evaluated our pipelines with big geospatial data and satellite imageries. Among them, our deep learning approaches have performed very well both quantitatively and qualitatively with the capability of reconstructing high quality features. We also show how to leverage temporal data for better reconstruction. For uniform deletion, the reconstruction accuracy observed is as high as 98.75% on an average for spatial meteorological data (e.g., soil moisture and albedo), and 99.09% for satellite imagery. Pushing the deletion rate further by following variance based deletion method, the decrease in accuracy remains within 1% for spatial meteorological data and 7% for satellite imagery

    Doctor of Philosophy

    Get PDF
    dissertationMachine learning is the science of building predictive models from data that automatically improve based on past experience. To learn these models, traditional learning algorithms require labeled data. They also require that the entire dataset fits in the memory of a single machine. Labeled data are available or can be acquired for small and moderately sized datasets but curating large datasets can be prohibitively expensive. Similarly, massive datasets are usually too huge to fit into the memory of a single machine. An alternative is to distribute the dataset over multiple machines. Distributed learning, however, poses new challenges as most existing machine learning techniques are inherently sequential. Additionally, these distributed approaches have to be designed keeping in mind various resource limitations of real-world settings, prime among them being intermachine communication. With the advent of big datasets machine learning algorithms are facing new challenges. Their design is no longer limited to minimizing some loss function but, additionally, needs to consider other resources that are critical when learning at scale. In this thesis, we explore different models and measures for learning with limited resources that have a budget. What budgetary constraints are posed by modern datasets? Can we reuse or combine existing machine learning paradigms to address these challenges at scale? How does the cost metrics change when we shift to distributed models for learning? These are some of the questions that have been investigated in this thesis. The answers to these questions hold the key to addressing some of the challenges faced when learning on massive datasets. In the first part of this thesis, we present three different budgeted scenarios that deal with scarcity of labeled data and limited computational resources. The goal is to leverage transfer information from related domains to learn under budgetary constraints. Our proposed techniques comprise semisupervised transfer, online transfer and active transfer. In the second part of this thesis, we study distributed learning with limited communication. We present initial sampling based results, as well as, propose communication protocols for learning distributed linear classifiers

    Heterogeneous unsupervised domain adaptation based on fuzzy feature fusion

    Full text link
    © 2017 IEEE. Domain adaptation is a transfer learning approach that has been widely studied in the last decade. However, existing works still have two limitations: 1) the feature spaces of the domains are homogeneous, and 2) the target domain has at least a few labeled instances. Both limitations significantly restrict the domain adaptation approach when knowledge is transferred across domains, especially in the current era of big data. To address both issues, this paper proposes a novel fuzzy-based heterogeneous unsupervised domain adaptation approach. This approach maps the feature spaces of the source and target domains onto the same latent space constructed by fuzzy features. In the new feature space, the label spaces of two domains are maintained to reduce the probability of negative transfer occurring. The proposed approach delivers superior performance over current benchmarks, and the heterogeneous unsupervised domain adaptation (HeUDA) method provides a promising means of giving a learning system the associative ability to judge unknown things using related knowledge

    Artificial intelligence: reflecting on the past and looking towards the next paradigm shift

    Get PDF
    Artificial intelligence (AI) has undergone major advances over the past decades, propelled by key innovations in machine learning and the availability of big data and computing power. This paper surveys the historical progress of AI from its origins in logic-based systems like the Logic Theorist to recent deep learning breakthroughs like Bidirectional Encoder Representations from Transformers (BERT), Generative Pretrained Transformer 3 (GPT-3) and Large Language Model Meta AI (LLaMA). The early rule-based systems using handcrafted expertise gave way to statistical learning techniques and neural networks trained on large datasets. Milestones like AlexNet and AlphaGo established deep learning as a dominant AI approach. Transfer learning enabled models pre-trained on diverse corpora to excel at specialised downstream tasks. The scope of AI expanded from niche applications like playing chess to multifaceted capabilities in computer vision, natural language processing and dialogue agents. However, current AI still needs to catch up to human intelligence in aspects like reasoning, creativity, and empathy. Addressing limitations around real-world knowledge, biases, and transparency remains vital for further progress and aligning AI with human values. This survey provides a comprehensive overview of the evolution of AI and documents innovations that shaped its advancement over the past six decades

    Glyph: Fast and Accurately Training Deep Neural Networks on Encrypted Data

    Full text link
    Big data is one of the cornerstones to enabling and training deep neural networks (DNNs). Because of the lack of expertise, to gain benefits from their data, average users have to rely on and upload their private data to big data companies they may not trust. Due to the compliance, legal, or privacy constraints, most users are willing to contribute only their encrypted data, and lack interests or resources to join the training of DNNs in cloud. To train a DNN on encrypted data in a completely non-interactive way, a recent work proposes a fully homomorphic encryption (FHE)-based technique implementing all activations in the neural network by \textit{Brakerski-Gentry-Vaikuntanathan (BGV)}-based lookup tables. However, such inefficient lookup-table-based activations significantly prolong the training latency of privacy-preserving DNNs. In this paper, we propose, Glyph, a FHE-based scheme to fast and accurately train DNNs on encrypted data by switching between TFHE (Fast Fully Homomorphic Encryption over the Torus) and BGV cryptosystems. Glyph uses logic-operation-friendly TFHE to implement nonlinear activations, while adopts vectorial-arithmetic-friendly BGV to perform multiply-accumulation (MAC) operations. Glyph further applies transfer learning on the training of DNNs to improve the test accuracy and reduce the number of MAC operations between ciphertext and ciphertext in convolutional layers. Our experimental results show Glyph obtains the state-of-the-art test accuracy, but reduces the training latency by 99%99\% over the prior FHE-based technique on various encrypted datasets.Comment: 10 pages, 8 figure
    • …
    corecore