355 research outputs found

    Supervised learning using a symmetric bilinear form for record linkage

    Get PDF
    Record Linkage is used to link records of two different files corresponding to the same individuals. These algorithms are used for database integration. In data privacy, these algorithms are used to evaluate the disclosure risk of a protected data set by linking records that belong to the same individual. The degree of success when linking the original (unprotected data) with the protected data gives an estimation of the disclosure risk. In this paper we propose a new parameterized aggregation operator and a supervised learning method for disclosure risk assessment. The parameterized operator is a symmetric bilinear form and the supervised learning method is formalized as an optimization problem. The target of the optimization problem is to find the values of the aggregation parameters that maximize the number of re-identification (or correct links). We evaluate and compare our proposal with other non-parametrized variations of record linkage, such as those using the Mahalanobis distance and the Euclidean distance (one of the most used approaches for this purpose). Additionally, we also compare it with other previously presented parameterized aggregation operators for record linkage such as the weighted mean and the Choquet integral. From these comparisons we show how the proposed aggregation operator is able to overcome or at least achieve similar results than the other parameterized operators. We also study which are the necessary optimization problem conditions to consider the described aggregation functions as metric functions

    Supervised learning using a symmetric bilinear form for record linkage

    Full text link

    Embedding Approaches for Relational Data

    Get PDF
    ​Embedding methods for searching latent representations of the data are very important tools for unsupervised and supervised machine learning as well as information visualisation. Over the years, such methods have continually progressed towards the ability to capture and analyse the structure and latent characteristics of larger and more complex data. In this thesis, we examine the problem of developing efficient and reliable embedding methods for revealing, understanding, and exploiting the different aspects of the relational data. We split our work into three pieces, where each deals with a different relational data structure. In the first part, we are handling with the weighted bipartite relational structure. Based on the relational measurements between two groups of heterogeneous objects, our goal is to generate low dimensional representations of these two different types of objects in a unified common space. We propose a novel method that models the embedding of each object type symmetrically to the other type, subject to flexible scale constraints and weighting parameters. The embedding generation relies on an efficient optimisation despatched using matrix decomposition. And we have also proposed a simple way of measuring the conformity between the original object relations and the ones re-estimated from the embeddings, in order to achieve model selection by identifying the optimal model parameters with a simple search procedure. We show that our proposed method achieves consistently better or on-par results on multiple synthetic datasets and real world ones from the text mining domain when compared with existing embedding generation approaches. In the second part of this thesis, we focus on the multi-relational data, where objects are interlinked by various relation types. Embedding approaches are very popular in this field, they typically encode objects and relation types with hidden representations and use the operations between them to compute the positive scalars corresponding to the linkages' likelihood score. In this work, we aim at further improving the existing embedding techniques by taking into account the multiple facets of the different patterns and behaviours of each relation type. To the best of our knowledge, this is the first latent representation model which considers relational representations to be dependent on the objects they relate in this field. The multi-modality of the relation type over different objects is effectively formulated as a projection matrix over the space spanned by the object vectors. Two large benchmark knowledge bases are used to evaluate the performance with respect to the link prediction task. And a new test data partition scheme is proposed to offer a better understanding of the behaviour of a link prediction model. In the last part of this thesis, a much more complex relational structure is considered. In particular, we aim at developing novel embedding methods for jointly modelling the linkage structure and objects' attributes. Traditionally, link prediction task is carried out on either the linkage structure or the objects' attributes, which does not aware of their semantic connections and is insufficient for handling the complex link prediction task. Thus, our goal in this work is to build a reliable model that can fuse both sources of information to improve the link prediction problem. The key idea of our approach is to encode both the linkage validities and the nodes neighbourhood information into embedding-based conditional probabilities. Another important aspect of our proposed algorithm is that we utilise a margin-based contrastive training process for encoding the linkage structure, which relies on a more appropriate assumption and dramatically reduces the number of training links. In the experiments, our proposed method indeed improves the link prediction performance on three citation/hyperlink datasets, when compared with those methods relying on only the nodes' attributes or the linkage structure, and it also achieves much better performances compared with the state-of-arts

    Automatic privacy and utility evaluation of anonymized documents via deep learning

    Get PDF
    Text anonymization methods are evaluated by comparing their outputs with human-based anonymizations through standard information retrieval (IR) metrics. On the one hand, the residual disclosure risk is quantified with the recall metric, which gives the proportion of re-identifying terms successfully detected by the anonymization algorithm. On the other hand, the preserved utility is measured with the precision metric, which accounts the proportion of masked terms that were also annotated by the human experts. Nevertheless, because these evaluation metrics were meant for information retrieval rather than privacy-oriented tasks, they suffer from several drawbacks. First, they assume a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, annotation-based evaluation relies on human judgements, which are inherently subjective and may be prone to errors. Finally, both metrics weight terms uniformly, thereby ignoring the fact that the influence on the disclosure risk or on utility preservation of some terms may be much larger than of others. To overcome these drawbacks, in this thesis we propose two novel methods to evaluate both the disclosure risk and the utility preserved in anonymized texts. Our approach leverages deep learning methods to perform this evaluation automatically, thereby not requiring human annotations. For assessing disclosure risks, we propose using a re-identification attack, which we define as a multi-class classification task built on top of state-of-the art language models. To make it feasible, the attack has been designed to capture the means and computational resources expected to be available at the attacker's end. For utility assessment, we propose a method that measures the information loss incurred during the anonymization process, which relies on a neural masked language modeling. We illustrate the effectiveness of our methods by evaluating the disclosure risk and retained utility of several well-known techniques and tools for text anonymization on a common dataset. Empirical results show significant privacy risks for all of them (including manual anonymization) and consistently proportional utility preservation

    Enhanced clustering analysis pipeline for performance analysis of parallel applications

    Get PDF
    Clustering analysis is widely used to stratify data in the same cluster when they are similar according to the specific metrics. We can use the cluster analysis to group the CPU burst of a parallel application, and the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the different computational trends or phases that appear in the application. These clusters are useful to understand the behavior of the computation part of the application and focus the analyses on those that present performance issues. Although density-based clustering algorithms are a powerful and efficient tool to summarize this type of information, their traditional user-guided clustering methodology has many shortcomings and deficiencies in dealing with the complexity of data, the diversity of data structures, high-dimensionality of data, and the dramatic increase in the amount of data. Consequently, the majority of DBSCAN-like algorithms have weaknesses to handle high-dimensionality and/or Multi-density data, and they are sensitive to their hyper-parameter configuration. Furthermore, extracting insight from the obtained clusters is an intuitive and manual task. To mitigate these weaknesses, we have proposed a new unified approach to replace the user-guided clustering with an automated clustering analysis pipeline, called Enhanced Cluster Identification and Interpretation (ECII) pipeline. To build the pipeline, we propose novel techniques including Robust Independent Feature Selection, Feature Space Curvature Map, Organization Component Analysis, and hyper-parameters tuning to feature selection, density homogenization, cluster interpretation, and model selection which are the main components of our machine learning pipeline. This thesis contributes four new techniques to the Machine Learning field with a particular use case in Performance Analytics field. The first contribution is a novel unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose a feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis. Simultaneously, the noise is separated as an independent component. The second contribution of the thesis is a parametric multilinear transformation method to homogenize cluster densities while preserving the topological structure of the dataset, called Feature Space Curvature Map (FSCM). We present a new Gravitational Self-organizing Map to model the feature space curvature by plugging the concepts of gravity and fabric of space into the Self-organizing Map algorithm to mathematically describe the density structure of the data. To homogenize the cluster density, we introduce a novel mapping mechanism to project the data from the non-Euclidean curved space to a new Euclidean flat space. The third contribution is a novel topological-based method to study potentially complex high-dimensional categorized data by quantifying their shapes and extracting fine-grain insights from them to interpret the clustering result. We introduce our Organization Component Analysis (OCA) method for the automatic arbitrary cluster-shape study without an assumption about the data distribution. Finally, to tune the DBSCAN hyper-parameters, we propose a new tuning mechanism by combining techniques from machine learning and optimization domains, and we embed it in the ECII pipeline. Using this cluster analysis pipeline with the CPU burst data of a parallel application, we provide the developer/analyst with a high-quality SPMD computation structure detection with the added value that reflects the fine grain of the computation regions.El análisis de conglomerados se usa ampliamente para estratificar datos en el mismo conglomerado cuando son similares según las métricas específicas. Nosotros puede usar el análisis de clúster para agrupar la ráfaga de CPU de una aplicación paralela y las regiones en cada proceso intermedio llamadas de comunicación o llamadas al tiempo de ejecución paralelo. Los clusters resultantes obtenidos son las diferentes tendencias computacionales o fases que aparecen en la solicitud. Estos clusters son útiles para entender el comportamiento de la parte de computación del aplicación y centrar los análisis en aquellos que presenten problemas de rendimiento. Aunque los algoritmos de agrupamiento basados en la densidad son una herramienta poderosa y eficiente para resumir este tipo de información, su La metodología tradicional de agrupación en clústeres guiada por el usuario tiene muchas deficiencias y deficiencias al tratar con la complejidad de los datos, la diversidad de estructuras de datos, la alta dimensionalidad de los datos y el aumento dramático en la cantidad de datos. En consecuencia, el La mayoría de los algoritmos similares a DBSCAN tienen debilidades para manejar datos de alta dimensionalidad y/o densidad múltiple, y son sensibles a su configuración de hiperparámetros. Además, extraer información de los clústeres obtenidos es una forma intuitiva y tarea manual Para mitigar estas debilidades, hemos propuesto un nuevo enfoque unificado para reemplazar el agrupamiento guiado por el usuario con un canalización de análisis de agrupamiento automatizado, llamada canalización de identificación e interpretación de clúster mejorada (ECII). para construir el tubería, proponemos técnicas novedosas que incluyen la selección robusta de características independientes, el mapa de curvatura del espacio de características, Análisis de componentes de la organización y ajuste de hiperparámetros para la selección de características, homogeneización de densidad, agrupación interpretación y selección de modelos, que son los componentes principales de nuestra canalización de aprendizaje automático. Esta tesis aporta cuatro nuevas técnicas al campo de Machine Learning con un caso de uso particular en el campo de Performance Analytics. La primera contribución es un enfoque novedoso no supervisado para la selección de características en datos ruidosos, llamado Robust Independent Feature. Selección (RIFS).Específicamente, elegimos un subconjunto de funciones que contiene la mayor parte de la información subyacente, utilizando el mismo criterios como el análisis de componentes independientes. Simultáneamente, el ruido se separa como un componente independiente. La segunda contribución de la tesis es un método de transformación multilineal paramétrica para homogeneizar densidades de clústeres mientras preservando la estructura topológica del conjunto de datos, llamado Mapa de Curvatura del Espacio de Características (FSCM). Presentamos un nuevo Gravitacional Mapa autoorganizado para modelar la curvatura del espacio característico conectando los conceptos de gravedad y estructura del espacio en el Algoritmo de mapa autoorganizado para describir matemáticamente la estructura de densidad de los datos. Para homogeneizar la densidad del racimo, introducimos un mecanismo de mapeo novedoso para proyectar los datos del espacio curvo no euclidiano a un nuevo plano euclidiano espacio. La tercera contribución es un nuevo método basado en topología para estudiar datos categorizados de alta dimensión potencialmente complejos mediante cuantificando sus formas y extrayendo información detallada de ellas para interpretar el resultado de la agrupación. presentamos nuestro Método de análisis de componentes de organización (OCA) para el estudio automático de forma arbitraria de conglomerados sin una suposición sobre el distribución de datos.Postprint (published version

    An architecture for secure data management in medical research and aided diagnosis

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01[Resumo] O Regulamento Xeral de Proteccion de Datos (GDPR) implantouse o 25 de maio de 2018 e considerase o desenvolvemento mais importante na regulacion da privacidade de datos dos ultimos 20 anos. As multas fortes definense por violar esas regras e non e algo que os centros sanitarios poidan permitirse ignorar. O obxectivo principal desta tese e estudar e proponer unha capa segura/integracion para os curadores de datos sanitarios, onde: a conectividade entre sistemas illados (localizacions), a unificacion de rexistros nunha vision centrada no paciente e a comparticion de datos coa aprobacion do consentimento sexan as pedras angulares de a arquitectura controlar a sua identidade, os perfis de privacidade e as subvencions de acceso. Ten como obxectivo minimizar o medo a responsabilidade legal ao compartir os rexistros medicos mediante o uso da anonimizacion e facendo que os pacientes sexan responsables de protexer os seus propios rexistros medicos, pero preservando a calidade do tratamento do paciente. A nosa hipotese principal e: os conceptos Distributed Ledger e Self-Sovereign Identity son unha simbiose natural para resolver os retos do GDPR no contexto da saude? Requirense solucions para que os medicos e investigadores poidan manter os seus fluxos de traballo de colaboracion sen comprometer as regulacions. A arquitectura proposta logra eses obxectivos nun ambiente descentralizado adoptando perfis de privacidade de datos illados.[Resumen] El Reglamento General de Proteccion de Datos (GDPR) se implemento el 25 de mayo de 2018 y se considera el desarrollo mas importante en la regulacion de privacidad de datos en los ultimos 20 anos. Las fuertes multas estan definidas por violar esas reglas y no es algo que los centros de salud puedan darse el lujo de ignorar. El objetivo principal de esta tesis es estudiar y proponer una capa segura/de integración para curadores de datos de atencion medica, donde: la conectividad entre sistemas aislados (ubicaciones), la unificacion de registros en una vista centrada en el paciente y el intercambio de datos con la aprobacion del consentimiento son los pilares de la arquitectura propuesta. Esta propuesta otorga al titular de los datos un rol central, que le permite controlar su identidad, perfiles de privacidad y permisos de acceso. Su objetivo es minimizar el temor a la responsabilidad legal al compartir registros medicos utilizando el anonimato y haciendo que los pacientes sean responsables de proteger sus propios registros medicos, preservando al mismo tiempo la calidad del tratamiento del paciente. Nuestra hipotesis principal es: .son los conceptos de libro mayor distribuido e identidad autosuficiente una simbiosis natural para resolver los desafios del RGPD en el contexto de la atencion medica? Se requieren soluciones para que los medicos y los investigadores puedan mantener sus flujos de trabajo de colaboracion sin comprometer las regulaciones. La arquitectura propuesta logra esos objetivos en un entorno descentralizado mediante la adopcion de perfiles de privacidad de datos aislados.[Abstract] The General Data Protection Regulation (GDPR) was implemented on 25 May 2018 and is considered the most important development in data privacy regulation in the last 20 years. Heavy fines are defined for violating those rules and is not something that healthcare centers can afford to ignore. The main goal of this thesis is to study and propose a secure/integration layer for healthcare data curators, where: connectivity between isolated systems (locations), unification of records in a patientcentric view and data sharing with consent approval are the cornerstones of the proposed architecture. This proposal empowers the data subject with a central role, which allows to control their identity, privacy profiles and access grants. It aims to minimize the fear of legal liability when sharing medical records by using anonymisation and making patients responsible for securing their own medical records, yet preserving the patient’s quality of treatment. Our main hypothesis is: are the Distributed Ledger and Self-Sovereign Identity concepts a natural symbiosis to solve the GDPR challenges in the context of healthcare? Solutions are required so that clinicians and researchers can maintain their collaboration workflows without compromising regulations. The proposed architecture accomplishes those objectives in a decentralized environment by adopting isolated data privacy profiles

    A Novel Privacy Disclosure Risk Measure and Optimizing Privacy Preserving Data Publishing Techniques

    Get PDF
    A tremendous amount of individual-level data is generated each day, with a wide variety of uses. This data often contains sensitive information about individuals, which can be disclosed by “adversaries”. Even when direct identifiers such as social security numbers are masked, an adversary may be able to recognize an individual\u27s identity for a data record by looking at the values of quasi-identifiers (QID), known as identity disclosure, or can uncover sensitive attributes (SA) about an individual through attribute disclosure. In data privacy field, multiple disclosure risk measures have been proposed. These share two drawbacks: they do not consider identity and attribute disclosure concurrently, and they make restrictive assumptions on an adversary\u27s knowledge and disclosure target by assuming certain attributes are QIDs and SAs with clear boundary in between. In this study, we present a Flexible Adversary Disclosure Risk (FADR) measure that addresses these limitations, by presenting a single combined metric of identity and attribute disclosure, and considering all scenarios for an adversary’s knowledge and disclosure targets while providing the flexibility to model a specific disclosure preference. In addition, we employ FADR measure to develop our novel “RU Generalization” algorithm that anonymizes a sensitive dataset to be able to publish the data for public access while preserving the privacy of individuals in the dataset. The challenge is to preserve privacy without incurring excessive information loss. Our RU Generalization algorithm is a greedy heuristic algorithm, which aims at minimizing the combination of both disclosure risk and information loss, to obtain an optimized anonymized dataset. We have conducted a set of experiments on a benchmark dataset from 1994 Census database, to evaluate both our FADR measure and RU Generalization algorithm. We have shown the robustness of our FADR measure and the effectiveness of our RU Generalization algorithm by comparing with the benchmark anonymization algorithm

    Modelling and Detecting Faults of Permanent Magnet Synchronous Motors in Dynamic Operations

    Get PDF
    Paper VI is excluded from the dissertation until the article will be published.Permanent magnet synchronous motors (PMSMs) have played a key role in commercial and industrial applications, i.e. electric vehicles and wind turbines. They are popular due to their high efficiency, control simplification and large torque-to-size ratio although they are expensive. A fault will eventually occur in an operating PMSM, either by improper maintenance or wear from thermal and mechanical stresses. The most frequent PMSM faults are bearing faults, short-circuit and eccentricity. PMSM may also suffer from demagnetisation, which is unique in permanent magnet machines. Condition monitoring or fault diagnosis schemes are necessary for detecting and identifying these faults early in their incipient state, e.g. partial demagnetisation and inter-turn short circuit. Successful fault classification will ensure safe operations, speed up the maintenance process and decrease unexpected downtime and cost. The research in recent years is drawn towards fault analysis under dynamic operating conditions, i.e. variable load and speed. Most of these techniques have focused on the use of voltage, current and torque, while magnetic flux density in the air-gap or the proximity of the motor has not yet been fully capitalised. This dissertation focuses on two main research topics in modelling and diagnosis of faulty PMSM in dynamic operations. The first problem is to decrease the computational burden of modelling and analysis techniques. The first contributions are new and faster methods for computing the permeance network model and quadratic time-frequency distributions. Reducing their computational burden makes them more attractive in analysis or fault diagnosis. The second contribution is to expand the model description of a simpler model. This can be achieved through a field reconstruction model with a magnet library and a description of both magnet defects and inter-turn short circuits. The second research topic is to simplify the installation and complexity of fault diagnosis schemes in PMSM. The aim is to reduce required sensors of fault diagnosis schemes, regardless of operation profiles. Conventional methods often rely on either steady-state or predefined operation profiles, e.g. start-up. A fault diagnosis scheme robust to any speed changes is desirable since a fault can be detected regardless of operations. The final contribution is the implementation of reinforcement learning in an active learning scheme to address the imbalance dataset problem. Samples from a faulty PMSM are often initially unavailable and expensive to acquire. Reinforcement learning with a weighted reward function might balance the dataset to enhance the trained fault classifier’s performance.publishedVersio

    Machine Learning Methods for Brain Image Analysis

    Get PDF
    Understanding how the brain functions and quantifying compound interactions between complex synaptic networks inside the brain remain some of the most challenging problems in neuroscience. Lack or abundance of data, shortage of manpower along with heterogeneity of data following from various species all served as an added complexity to the already perplexing problem. The ability to process vast amount of brain data need to be performed automatically, yet with an accuracy close to manual human-level performance. These automated methods essentially need to generalize well to be able to accommodate data from different species. Also, novel approaches and techniques are becoming a necessity to reveal the correlations between different data modalities in the brain at the global level. In this dissertation, I mainly focus on two problems: automatic segmentation of brain electron microscopy (EM) images and stacks, and integrative analysis of the gene expression and synaptic connectivity in the brain. I propose to use deep learning algorithms for the 2D segmentation of EM images. I designed an automated pipeline with novel insights that was able to achieve state-of-the-art performance on the segmentation of the \textit{Drosophila} brain. I also propose a novel technique for 3D segmentation of EM image stacks that can be trained end-to-end with no prior knowledge of the data. This technique was evaluated in an ongoing online challenge for 3D segmentation of neurites where it achieved accuracy close to a second human observer. Later, I employed ensemble learning methods to perform the first systematic integrative analysis of the genome and connectome in the mouse brain at both the regional- and voxel-level. I show that the connectivity signals can be predicted from the gene expression signatures with an extremely high accuracy. Furthermore, I show that only a certain fraction of genes are responsible for this predictive aspect. Rich functional and cellular analysis of these genes are detailed to validate these findings
    corecore