3,351 research outputs found

    A systematic overview on methods to protect sensitive data provided for various analyses

    Get PDF
    In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters

    Information Preserving Processing of Noisy Handwritten Document Images

    Get PDF
    Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%

    A Machine Learning Approach to Indoor Localization Data Mining

    Get PDF
    Indoor positioning systems are increasingly commonplace in various environments and produce large quantities of data. They are used in industrial applications, robotics, asset and employee tracking just to name a few use cases. The growing amount of data and the accelerating progress of machine learning opens up many new possibilities for analyzing this data in ways that were not conceivable or relevant before. This paper introduces connected concepts and implementations to answer question how this data can be utilized. Data gathered in this thesis originates from an indoor positioning system deployed in retail environment, but the discussed methods can be applied generally. The issue will be approached by first introducing the concept of machine learning and more generally, artificial intelligence, and how they work on a general level. A deeper dive is done to subfields and algorithms that are relevant to the data mining task at hand. Indoor positioning system basics are also shortly discussed to create a base understanding on the realistic capabilities and constraints that these kinds of systems encase. These methods and previous knowledge from literature are put to test with the freshly gathered data. An algorithm based on existing example from literature was tested and improved upon with the new data. A novel method to cluster and classify movement patterns was introduced, utilizing deep learning to create embedded representations of the trajectories in a more complex learning pipeline. This type of learning is often referred to as deep clustering. The results are promising and both of the methods produce useful high level representations of the complex dataset that can help a human operator to discern the relevant patterns from raw data and to be used as an input for subsequent supervised and unsupervised learning steps. Several factors related to optimizing the learning pipeline, such as regularization were also researched and the results presented as visualizations. The research found that pipeline consisting of CNN-autoencoder followed by a classic clustering algorithm such as DBSCAN produces useful results in the form of trajectory clusters. Regularization such as L1 regression improves this performance. The research done in this paper presents useful algorithms for processing raw, noisy localization data from indoor environments that can be used for further implementations in both industrial applications and academia

    Remote Sensing of Icebergs in Greenland\u27s Fjords and Coastal Waters

    Get PDF
    Increases in ocean water temperature are implicated in driving recent accelerated rates of mass loss from the Greenland Ice Sheet. Icebergs provide a key tool for gaining insight into ice-ocean interactions and until recently have been relatively understudied. Here we develop several methods that exploit icebergs visible in optical satellite imagery to provide insight on the ice--ocean environment and explore how iceberg datasets can be used to examine the physics of iceberg decay and parent glacier properties. First, a semi-automated algorithm, which includes a machine learning-based cloud mask, is applied to six years (2000-2002 and 2013-2015) of the Landsat archive to derive iceberg size distributions for Disko Bay. These data show an increase in the total number of icebergs and suggest a change in the shape of the iceberg size distribution, concurrent with a shift in the dominant calving style of Sermeq Kujalleq (Jakobshavn Isbrae), their parent glacier. Second, bathymetry is qualitatively and quantitatively inferred using icebergs as drifters; regions of iceberg drifting and stranding indicate relative bathymetric lows and highs, respectively. To quantify water depth in shallow regions, iceberg draft is inferred from iceberg freeboard under the assumption of hydrostatic equilibrium where very high-resolution stereo image pairs of icebergs are available to construct digital elevation models. Although this results in water depths with relatively large uncertainties, the method provides valuable quantitative data in regions where bathymetric observations are unavailable, improving our understanding of sill locations and the consequent ability of warm ocean waters to reach glacier termini. Third, we use the iceberg datasets derived using the previously described methods to probe the spatial patterns of iceberg size distributions. Rigorous discrimination between power law and lognormal size distributions is challenging, but our datasets corroborate the idea that as icebergs move farther from the parent glacier and the primary control on iceberg size transitions from fracture to melting, their size distribution shifts from power law to lognormal. Overall, our analysis suggests that future thorough investigations of iceberg size distributions will serve as a valuable tool to gain insights into the physics of iceberg decay and properties of the parent glacier

    Probe-based visual analysis of geospatial simulations

    Get PDF
    This work documents the design, development, refinement, and evaluation of probes as an interaction technique for expanding both the usefulness and usability of geospatial visualizations, specifically those of simulations. Existing applications that allow the visualization of, and interaction with, geospatial simulations and their results generally present views of the data that restrict the user to a single perspective. When zoomed out, local trends and anomalies become suppressed and lost; when zoomed in, spatial awareness and comparison between regions become limited. The probe-based interaction model integrates coordinated visualizations within individual probe interfaces, which depict the local data in user-defined regions-of-interest. It is especially useful when dealing with complex simulations or analyses where behavior in various localities differs from other localities and from the system as a whole. The technique has been incorporated into a number of geospatial simulations and visualization tools. In each of these applications, and in general, probe-based interaction enhances spatial awareness, improves inspection and comparison capabilities, expands the range of scopes, and facilitates collaboration among multiple users. The great freedom afforded to users in defining regions-of-interest can cause modifiable areal unit problems to affect the reliability of analyses without the user’s knowledge, leading to misleading results. However, by automatically alerting the user to these potential issues, and providing them tools to help adjust their selections, these unforeseen problems can be revealed, and even corrected

    Fool\u27s Gold: An Illustrated Critique of Differential Privacy

    Get PDF
    Differential privacy has taken the privacy community by storm. Computer scientists developed this technique to allow researchers to submit queries to databases without being able to glean sensitive information about the individuals described in the data. Legal scholars champion differential privacy as a practical solution to the competing interests in research and confidentiality, and policymakers are poised to adopt it as the gold standard for data privacy. It would be a disastrous mistake. This Article provides an illustrated guide to the virtues and pitfalls of differential privacy. While the technique is suitable for a narrow set of research uses, the great majority of analyses would produce results that are beyond absurd--average income in the negative millions or correlations well above 1.0, for example. The legal community mistakenly believes that differential privacy can offer the benefits of data research without sacrificing privacy. In fact, differential privacy will usually produce either very wrong research results or very useless privacy protections. Policymakers and data stewards will have to rely on a mix of approaches--perhaps differential privacy where it is well suited to the task and other disclosure prevention techniques in the great majority of situations where it isn\u27t
    • …
    corecore