16 research outputs found
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
Advances in Information Security and Privacy
With the recent pandemic emergency, many people are spending their days in smart working and have increased their use of digital resources for both work and entertainment. The result is that the amount of digital information handled online is dramatically increased, and we can observe a significant increase in the number of attacks, breaches, and hacks. This Special Issue aims to establish the state of the art in protecting information by mitigating information risks. This objective is reached by presenting both surveys on specific topics and original approaches and solutions to specific problems. In total, 16 papers have been published in this Special Issue
Automated Assessment of the Aftermath of Typhoons Using Social Media Texts
Disasters are one of the major threats to economics and human societies, causing substantial losses of human lives, properties and infrastructures. It has been our persistent endeavors to understand, prevent and reduce such disasters, and the popularization of social media is offering new opportunities to enhance disaster management in a crowd-sourcing approach. However, social media data is also characterized by its undue brevity, intense noise, and informality of language. The existing literature has not completely addressed these disadvantages, otherwise vast manual efforts are devoted to tackling these problems.
The major focus of this research is on constructing a holistic framework to exploit social media data in typhoon damage assessment. The scope of this research covers data collection, relevance classification, location extraction and damage assessment while assorted approaches are utilized to overcome the disadvantages of social media data. Moreover, a semi-supervised or unsupervised approach is prioritized in forming the framework to minimize manual intervention.
In data collection, query expansion strategy is adopted to optimize the search recall of typhoon-relevant information retrieval. Multiple filtering strategies are developed to screen the keywords and maintain the relevance to search topics in the keyword updates. A classifier based on a convolutional neural network is presented for relevance classification, with hashtags and word clusters as extra input channels to augment the information. In location extraction, a model is constructed by integrating Bidirectional Long Short-Time Memory and Conditional Random Fields. Feature noise correction layers and label smoothing are leveraged to handle the noisy training data. Finally, a multi-instance multi-label classifier identifies the damage relations in four categories, and the damage categories of a message are integrated with the damage descriptions score to obtain damage severity score for the message.
A case study is conducted to verify the effectiveness of the framework. The outcomes indicate that the approaches and models developed in this study significantly improve in the classification of social media texts especially under the framework of semi-supervised or unsupervised learning. Moreover, the results of damage assessment from social media data are remarkably consistent with the official statistics, which demonstrates the practicality of the proposed damage scoring scheme
Préserver la vie privée des individus grâce aux Systèmes Personnels de Gestion des Données
Riding the wave of smart disclosure initiatives and new privacy-protection regulations, the Personal Cloud paradigm is emerging through a myriad of solutions offered to users to let them gather and manage their whole digital life. On the bright side, this opens the way to novel value-added services when crossing multiple sources of data of a given person or crossing the data of multiple people. Yet this paradigm shift towards user empowerment raises fundamental questions with regards to the appropriateness of the functionalities and the data management and protection techniques which are offered by existing solutions to laymen users. Our work addresses these questions on three levels. First, we review, compare and analyze personal cloud alternatives in terms of the functionalities they provide and the threat models they target. From this analysis, we derive a general set of functionality and security requirements that any Personal Data Management System (PDMS) should consider. We then identify the challenges of implementing such a PDMS and propose a preliminary design for an extensive and secure PDMS reference architecture satisfying the considered requirements. Second, we focus on personal computations for a specific hardware PDMS instance (i.e., secure token with mass storage of NAND Flash). In this context, we propose a scalable embedded full-text search engine to index large document collections and manage tag-based access control policies. Third, we address the problem of collective computations in a fully-distributed architecture of PDMSs. We discuss the system and security requirements and propose protocols to enable distributed query processing with strong security guarantees against an attacker mastering many colluding corrupted nodes.Surfant sur la vague des initiatives de divulgation restreinte de données et des nouvelles réglementations en matière de protection de la vie privée, le paradigme du Cloud Personnel émerge à travers une myriade de solutions proposées aux utilisateurs leur permettant de rassembler et de gérer l'ensemble de leur vie numérique. Du côté positif, cela ouvre la voie à de nouveaux services à valeur ajoutée lors du croisement de plusieurs sources de données d'un individu ou du croisement des données de plusieurs personnes. Cependant, ce changement de paradigme vers la responsabilisation de l'utilisateur soulève des questions fondamentales quant à l'adéquation des fonctionnalités et des techniques de gestion et de protection des données proposées par les solutions existantes aux utilisateurs lambda. Notre travail aborde ces questions à trois niveaux. Tout d'abord, nous passons en revue, comparons et analysons les alternatives de cloud personnel au niveau des fonctionnalités fournies et des modèles de menaces ciblés. De cette analyse, nous déduisons un ensemble général d'exigences en matière de fonctionnalité et de sécurité que tout système personnel de gestion des données (PDMS) devrait prendre en compte. Nous identifions ensuite les défis liés à la mise en œuvre d'un tel PDMS et proposons une conception préliminaire pour une architecture PDMS étendue et sécurisée de référence répondant aux exigences considérées. Ensuite, nous nous concentrons sur les calculs personnels pour une instance matérielle spécifique du PDMS (à savoir, un dispositif personnel sécurisé avec un stockage de masse de type NAND Flash). Dans ce contexte, nous proposons un moteur de recherche plein texte embarqué et évolutif pour indexer de grandes collections de documents et gérer des politiques de contrôle d'accès basées sur des étiquettes. Troisièmement, nous abordons le problème des calculs collectifs dans une architecture entièrement distribuée de PDMS. Nous discutons des exigences d'architectures système et de sécurité et proposons des protocoles pour permettre le traitement distribué des requêtes avec de fortes garanties de sécurité contre un attaquant maîtrisant de nombreux nœuds corrompus
Extensible metadata management framework for personal data lake
Common Internet users today are inundated with a deluge of diverse data being generated and siloed in a variety of digital services, applications, and a growing body of personal computing devices as we enter the era of the Internet of Things. Alongside potential privacy compromises, users are facing increasing difficulties in managing their data and are losing control over it. There appears to be a de facto agreement in business and scientific fields that there is critical new value and interesting insight that can be attained by users from analysing their own data, if only it can be freed from its silos and combined with other data in meaningful ways. This thesis takes the point of view that users should have an easy-to-use modern personal data management solution that enables them to centralise and efficiently manage their data by themselves, under their full control, for their best interests, with minimum time and efforts. In that direction, we describe the basic architecture of a management solution that is designed based on solid theoretical foundations and state of the art big data technologies. This solution (called Personal Data Lake - PDL) collects the data of a user from a plurality of heterogeneous personal data sources and stores it into a highly-scalable schema-less storage repository. To simplify the user-experience of PDL, we propose a novel extensible metadata management framework (MMF) that: (i) annotates heterogeneous data with rich lineage and semantic metadata, (ii) exploits the garnered metadata for automating data management workflows in PDL – with extensive focus on data integration, and (iii) facilitates the use and reuse of the stored data for various purposes by querying it on the metadata level either directly by the user or through third party personal analytics services.
We first show how the proposed MMF is positioned in PDL architecture, and then describe its principal components. Specifically, we introduce a simple yet effective lineage manager for tracking the provenance of personal data in PDL. We then introduce an ontology-based data integration component called SemLinker which comprises two new algorithms; the first concerns generating graph-based representations to express the native schemas of (semi) structured personal data, and the second algorithm metamodels the extracted representations to a common extensible ontology. SemLinker outputs are utilised by MMF to generate user-tailored unified views that are optimised for querying heterogeneous personal data through low-level SPARQL or high-level SQL-like queries. Next, we introduce an unsupervised automatic keyphrase extraction algorithm called SemCluster that specialises in extracting thematically important keyphrases from unstructured data, and associating each keyphrase with ontological information drawn from an extensible WordNet-based ontology. SemCluster outputs serve as semantic metadata and are utilised by MMF to annotate unstructured contents in PDL, thus enabling various management functionalities such as relationship discovery and semantic search. Finally, we describe how MMF can be utilised to perform holistic integration of personal data and jointly querying it in native representations
Recommended from our members
Computational Methods in Multi-Messenger Astrophysics using Gravitational Waves and High Energy Neutrinos
This dissertation seeks to describe advancements made in computational methods for multi-messenger astrophysics (MMA) using gravitational waves GW and neutrinos during Advanced LIGO (aLIGO)’s first through third observing runs (O1-O3) and, looking forward, to describe novel computational techniques suited to the challenges of both the burgeoning MMA field and high-performance computing as a whole.
The first two chapters provide an overview of MMA as it pertains to gravitational wave/high energy neutrino (GWHEN) searches, including a summary of expected astrophysical sources as well as GW, neutrino, and gamma-ray detectors used in their detection. These are followed in the third chapter by an in-depth discussion of LIGO’s timing system, particularly the diagnostic subsystem, describing both its role in MMA searches and the author’s contributions to the system itself.
The fourth chapter provides a detailed description of the Low-Latency Algorithm for Multi-messenger Astrophysics (LLAMA), the GWHEN pipeline developed by the author and used in O2 and O3. Relevant past multi-messenger searches are described first, followed by the O2 and O3 analysis methods, the pipeline’s performance, scientific results, and finally, an in-depth account of the library’s structure and functionality. In particular, the author’s high-performance multi-order coordinates (MOC) HEALPix image analysis library, HPMOC, is described. HPMOC increases performance of HEALPix image manipulations by several orders of magnitude vs. naive single-resolution approaches while presenting a simple high-level interface and should prove useful for diverse future MMA searches. The performance improvements it provides for LLAMA are also covered.
The final chapter of this dissertation builds on the approaches taken in developing HPMOC, presenting several novel methods for efficiently storing and analyzing large data sets, with applications to MMA and other data-intensive fields. A family of depth-first multi-resolution ordering of HEALPix images — DEPTH9, DEPTH19, and DEPTH40 — is defined, along with algorithms and use cases where it can improve on current approaches, including high-speed streaming calculations suitable for serverless compute or FPGAs.
For performance-constrained analyses on HEALPix data (e.g. image analysis in multi-messenger search pipelines) using SIMD processors, breadth-first data structures can provide short-circuiting calculations in a data-parallel way on compressed data; a simple compression method is described with application to further improving LLAMA performance.
A new storage scheme and associated algorithms for efficiently compressing and contracting tensors of varying sparsity is presented; these demuxed tensors (D-Tensors) have equivalent asymptotic time and space complexity to optimal representations of both dense and sparse matrices, and could be used as a universal drop-in replacement to reduce code complexity and developer effort while improving performance of existing non-optimized numerical code. Finally, the big bucket hash table (B-Table), a novel type of hash table making guarantees on data layout (vs. load factor), is described, along with optimizations it allows for (like hardware acceleration, online rebuilds, and hard realtime applications) that are not possible with existing hash table approaches. These innovations are presented in the hope that some will prove useful for improving future MMA searches and other data-intensive applications
ICTERI 2020: ІКТ в освіті, дослідженнях та промислових застосуваннях. Інтеграція, гармонізація та передача знань 2020: Матеріали 16-ї Міжнародної конференції. Том II: Семінари. Харків, Україна, 06-10 жовтня 2020 р.
This volume represents the proceedings of the Workshops co-located with the 16th International Conference on ICT in Education, Research, and Industrial Applications, held in Kharkiv, Ukraine, in October 2020. It comprises 101 contributed papers that were carefully peer-reviewed and selected from 233 submissions for the five workshops: RMSEBT, TheRMIT, ITER, 3L-Person, CoSinE, MROL. The volume is structured in six parts, each presenting the contributions for a particular workshop. The topical scope of the volume is aligned with the thematic tracks of ICTERI 2020: (I) Advances in ICT Research; (II) Information Systems: Technology and Applications; (III) Academia/Industry ICT Cooperation; and (IV) ICT in Education.Цей збірник представляє матеріали семінарів, які були проведені в рамках 16-ї Міжнародної конференції з ІКТ в освіті, наукових дослідженнях та промислових застосуваннях, що відбулася в Харкові, Україна, у жовтні 2020 року. Він містить 101 доповідь, які були ретельно рецензовані та відібрані з 233 заявок на участь у п'яти воркшопах: RMSEBT, TheRMIT, ITER, 3L-Person, CoSinE, MROL. Збірник складається з шести частин, кожна з яких представляє матеріали для певного семінару. Тематична спрямованість збірника узгоджена з тематичними напрямками ICTERI 2020: (I) Досягнення в галузі досліджень ІКТ; (II) Інформаційні системи: Технології і застосування; (ІІІ) Співпраця в галузі ІКТ між академічними і промисловими колами; і (IV) ІКТ в освіті