9 research outputs found
Sistema de anonimización de datos estructurados
Las aproximaciones más empleadas en la industria para proteger los datos privados implican
deteriorar su utilidad para los ejercicios de analítica. Por ello, este trabajo propone Anonylitics,
un sistema para la anonimización de datos estructurados, que se fundamenta en la preservación
de la distribución de los datos numéricos, al mismo tiempo que se garantiza su privacidad. La
propuesta realizada permite seguir teniendo información útil para la analítica de datos a nivel
empresarial, lo cual es evidenciado a través de la validación efectuada mediante la anonimización
de dos conjuntos de datos reales que demuestran el potencial del sistema y sus algoritmos.The most used approaches in the industry to protect private data imply to impair its utility for
analytical exercises. For this reason, this work proposes Anonylitics, a system for the anonymization
of structured data, which is based on the preservation of the distribution of numerical
data, at the same time that their privacy is guaranteed. The proposal makes it possible to continue
having useful information for business data analytics, which is evidenced through the
validation carried out by anonymizing two sets of real data that demonstrate the potential of
the system and its algorithms.Magíster en Ingeniería de Sistemas y ComputaciónMaestrí
Supporting Autonomic Management of Clouds: Service Clustering with Random Forest
A promising solution for the management of services in clouds, as fostered by autonomic computing, is to resort to self-management. However, the obfuscation of underlying details of services in cloud computing, also due to privacy requirements, affects the effectiveness of autonomic managers. Data-driven approaches, in particular those relying on service clustering based on machine learning techniques, can assist the autonomic management and support decisions concerning, e.g., the scheduling and deployment of services. Unfortunately, applying such approaches is further complicated by the coexistence of different types of data within the information provided by the monitoring of cloud systems: both continuous (e.g., CPU load) and categorical (e.g., VM instance type) data are available. Current approaches deal with this problem in a heuristic fashion. In this paper, instead, we propose an approach that uses all types of data, and learns in a data-driven fashion the similarities and patterns among the services. More specifically, we design an unsupervised formulation of random forest to calculate service similarities and provide them as input to a clustering algorithm. For the sake of efficiency and to meet the dynamism requirement of autonomic clouds, our methodology consists of two steps: 1) off-line clustering and 2) on-line prediction. Using datasets from real-world clouds, we demonstrate the superiority of our solution with respect to others and validate the accuracy of the on-line prediction. Moreover, to show applicability of our approach, we devise a service scheduler that uses similarity among services, and evaluate its performance in a cloud test-bed using realistic data
Visualisation of Large-Scale Call-Centre Data
The contact centre industry employs 4% of the entire United King-dom and United States’ working population and generates gigabytes of operational data that require analysis, to provide insight and to improve efficiency. This thesis is the result of a collaboration with QPC Limited who provide data collection and analysis products for call centres. They provided a large data-set featuring almost 5 million calls to be analysed. This thesis utilises novel visualisation techniques to create tools for the exploration of the large, complex call centre data-set and to facilitate unique observations into the data.A survey of information visualisation books is presented, provid-ing a thorough background of the field. Following this, a feature-rich application that visualises large call centre data sets using scatterplots that support millions of points is presented. The application utilises both the CPU and GPU acceleration for processing and filtering and is exhibited with millions of call events.This is expanded upon with the use of glyphs to depict agent behaviour in a call centre. A technique is developed to cluster over-lapping glyphs into a single parent glyph dependant on zoom level and a customizable distance metric. This hierarchical glyph repre-sents the mean value of all child agent glyphs, removing overlap and reducing visual clutter. A novel technique for visualising individually tailored glyphs using a Graphics Processing Unit is also presented, and demonstrated rendering over 100,000 glyphs at interactive frame rates. An open-source code example is provided for reproducibility.Finally, a novel interaction and layout method is introduced for improving the scalability of chord diagrams to visualise call transfers. An exploration of sketch-based methods for showing multiple links and direction is made, and a sketch-based brushing technique for filtering is proposed. Feedback from domain experts in the call centre industry is reported for all applications developed
BRHIM - Base de Registros Hospitalares para Informações e Metadados
Os riscos de reidentificação de dados hospitalares são altos e há uma demanda por eles em projetos de desenvolvimento e validação de Inteligência Artificial (IA). Este trabalho aborda os principais métodos de preparação de registros hospitalares usados para realizar estudos observacionais de maneira direcionada de avaliar o risco de reidentificação e o impacto da perda de informações que a anonimização produz nos resultados da IA. Uma revisão sobre o assunto é apresentada no início e após são apresentados dois artigos, sempre considerando o contexto da utilização de registros hospitalares em estudos epidemiológicos. O primeiro artigo propõe uma ontologia de domínio para definir um escopo para a tratar a anonimização. Os tipos de ataques, os tipos de dados e atributos, os modelos de privacidade, os tipos de uso da inteligência artificial e os diferentes delineamentos são apresentados. Foi feito um exemplo de instância da ontologia na ferramenta Web Protegé, disponível pela Universidade de Stanford para a construção de ontologias e que permite replica-la. O segundo artigo visa definir uma receita de preparação de prontuário hospitalar com 5 etapas para implementar a pseudo-anonimização, desidentificação e anonimização de dados e comparar os efeitos dessas etapas em uma aplicação da IA. Para isto, um evento Datathon foi realizado para desenvolver um preditor de IA de mortalidade hospitalar. Comparando os resultados da IA usando os dados originais e os dados anônimos, demonstrando uma diferenca inferior a 1% nos resultados da AUC-ROC, enquanto o risco de um paciente ser identificado foi reduzido em 95%, demonstrando que o preparo pode ser sistematizado agregando privacidade e computando a perda de informações, a fim de torná-los transparentes.The risks of re-identifying hospital data is high and there is a demand for them in projects for the development and validation of Artificial Intelligence (AI). This approach addresses the main methods of preparing hospital records used to carry out observational studies and in a directed way to assess the risk of re-identification and the impact of the loss of information that anonymization produces on AI results. A review of the review on the subject is presented at the beginning and after the literature is presented two articles, always considering the context of the use of hospital records in epidemiological studies. The first article proposes a domain ontology to define a scope for the search for anonymization. The types of attacks, the types of attacks, the types of data and attributes, the privacy models, the types of use that artificial intelligence devices and the different delineations are presented. An example of an ontology instance was made in the Web Protegé tool, made available by Stanford University for building ontologies and which allows replicating pregnant children and thus disseminating anonymization atology. The article aims to define a second hospital record preparation recipe with 5 steps for implementing pseudo-anonymization, de-identification and data anonymization and to compare the effects of these steps in an AI application. A Datathon event was conducted to develop an AI predictor of hospital mortality. Comparing the AI results using the original data and the anonymized data, which were identified as less than 1% results on the AUC-ROC, while the risk of a registered patient was recorded at 95%, demonstrating that the preparation can be systematized with privacy privacy and information loss in order to make them transparent
Recommended from our members
Evaluating human-centered approaches for geovisualization
Working with two small group of domain experts I evaluate human-centered approaches to application development which are applicable to geovisualization, following an ISO13407 taxonomy that covers context of use, eliciting requirements, and design. These approaches include field studies and contextual analysis of subjects' context; establishing requirements using a template, via a lecture to communicate geovisualization to subjects and by communicating subjects' context to geovisualization experts with a scenario; autoethnography to understand the geovisualization design process; wireframe, paper and digital interactive prototyping with alternative protocols; and a decision making process for prioritising application improvement. I find that the acquisition and use of real user data is key; that a template approach and teaching subjects about visualization tools and interactions both fail to elicit useful requirements for a visualization application. Consulting geovisualization experts with a scenario of user context and samples of user data does yield suggestions for tools and interactions of use to a visualization designer. The complex and composite natures of both visualization and human-centered domains, incorporating learning from both domains, with user context, makes design challenging. Wireframe, paper and digital interactive prototypes mediate between the user and visualization domains successfully, eliciting exploratory behaviour and suggestions to improve prototypes. Paper prototypes are particularly successful at eliciting suggestions and especially novel visualization improvements. Decision-making techniques prove useful for prioritising different possible improvements, although domain subjects select data-related features over more novel alternative and rank these more inconsistently. The research concludes that understanding subject context of use and data is important and occurs throughout the process of engagement with domain experts, and that standard requirements elicitation techniques are unsuccessful for geovisualization. Engagement with subjects at an early stage with simple prototypes incorporating real subject data and moving to successively more complex prototypes holds the best promise for creating successful geovisualization applications
Actas de las VI Jornadas Nacionales (JNIC2021 LIVE)
Estas jornadas se han convertido en un foro de encuentro de los actores más relevantes en el ámbito de la ciberseguridad en España. En ellas, no sólo se presentan algunos de los trabajos científicos punteros en las diversas áreas de ciberseguridad, sino que se presta especial atención a la formación e innovación educativa en materia de ciberseguridad, y también a la conexión con la industria, a través de propuestas de transferencia de tecnología. Tanto es así que, este año se presentan en el Programa de Transferencia algunas modificaciones sobre su funcionamiento y desarrollo que han sido diseñadas con la intención de mejorarlo y hacerlo más valioso para toda la comunidad investigadora en ciberseguridad