157 research outputs found

    Share your Model instead of your Data: Privacy Preserving Mimic Learning for Ranking

    Get PDF
    Deep neural networks have become a primary tool for solving problems in many fields. They are also used for addressing information retrieval problems and show strong performance in several tasks. Training these models requires large, representative datasets and for most IR tasks, such data contains sensitive information from users. Privacy and confidentiality concerns prevent many data owners from sharing the data, thus today the research community can only benefit from research on large-scale datasets in a limited manner. In this paper, we discuss privacy preserving mimic learning, i.e., using predictions from a privacy preserving trained model instead of labels from the original sensitive training data as a supervision signal. We present the results of preliminary experiments in which we apply the idea of mimic learning and privacy preserving mimic learning for the task of document re-ranking as one of the core IR tasks. This research is a step toward laying the ground for enabling researchers from data-rich environments to share knowledge learned from actual users' data, which should facilitate research collaborations.Comment: SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17)}{}{August 7--11, 2017, Shinjuku, Tokyo, Japa

    Contributions to Lifelogging Protection In Streaming Environments

    Get PDF
    Tots els dies, més de cinc mil milions de persones generen algun tipus de dada a través d'Internet. Per accedir a aquesta informació, necessitem utilitzar serveis de recerca, ja siguin motors de cerca web o assistents personals. A cada interacció amb ells, el nostre registre d'accions, logs, s'utilitza per oferir una millor experiència. Per a les empreses, també són molt valuosos, ja que ofereixen una forma de monetitzar el servei. La monetització s'aconsegueix venent dades a tercers, però, els logs de consultes podrien exposar informació confidencial de l'usuari (identificadors, malalties, tendències sexuals, creences religioses) o usar-se per al que es diu "life-logging ": Un registre continu de les activitats diàries. La normativa obliga a protegir aquesta informació. S'han proposat prèviament sistemes de protecció per a conjunts de dades tancats, la majoria d'ells treballant amb arxius atòmics o dades estructurades. Desafortunadament, aquests sistemes no s'adapten quan es fan servir en el creixent entorn de dades no estructurades en temps real que representen els serveis d'Internet. Aquesta tesi té com objectiu dissenyar tècniques per protegir la informació confidencial de l'usuari en un entorn no estructurat d’streaming en temps real, garantint un equilibri entre la utilitat i la protecció de dades. S'han fet tres propostes per a una protecció eficaç dels logs. La primera és un nou mètode per anonimitzar logs de consultes, basat en k-anonimat probabilística i algunes eines de desanonimització per determinar fuites de dades. El segon mètode, s'ha millorat afegint un equilibri configurable entre privacitat i usabilitat, aconseguint una gran millora en termes d'utilitat de dades. La contribució final es refereix als assistents personals basats en Internet. La informació generada per aquests dispositius es pot considerar "life-logging" i pot augmentar els riscos de privacitat de l'usuari. Es proposa un esquema de protecció que combina anonimat de logs i signatures sanitizables.Todos los días, más de cinco mil millones de personas generan algún tipo de dato a través de Internet. Para acceder a esa información, necesitamos servicios de búsqueda, ya sean motores de búsqueda web o asistentes personales. En cada interacción con ellos, nuestro registro de acciones, logs, se utiliza para ofrecer una experiencia más útil. Para las empresas, también son muy valiosos, ya que ofrecen una forma de monetizar el servicio, vendiendo datos a terceros. Sin embargo, los logs podrían exponer información confidencial del usuario (identificadores, enfermedades, tendencias sexuales, creencias religiosas) o usarse para lo que se llama "life-logging": Un registro continuo de las actividades diarias. La normativa obliga a proteger esta información. Se han propuesto previamente sistemas de protección para conjuntos de datos cerrados, la mayoría de ellos trabajando con archivos atómicos o datos estructurados. Desafortunadamente, esos sistemas no se adaptan cuando se usan en el entorno de datos no estructurados en tiempo real que representan los servicios de Internet. Esta tesis tiene como objetivo diseñar técnicas para proteger la información confidencial del usuario en un entorno no estructurado de streaming en tiempo real, garantizando un equilibrio entre utilidad y protección de datos. Se han hecho tres propuestas para una protección eficaz de los logs. La primera es un nuevo método para anonimizar logs de consultas, basado en k-anonimato probabilístico y algunas herramientas de desanonimización para determinar fugas de datos. El segundo método, se ha mejorado añadiendo un equilibrio configurable entre privacidad y usabilidad, logrando una gran mejora en términos de utilidad de datos. La contribución final se refiere a los asistentes personales basados en Internet. La información generada por estos dispositivos se puede considerar “life-logging” y puede aumentar los riesgos de privacidad del usuario. Se propone un esquema de protección que combina anonimato de logs y firmas sanitizables.Every day, more than five billion people generate some kind of data over the Internet. As a tool for accessing that information, we need to use search services, either in the form of Web Search Engines or through Personal Assistants. On each interaction with them, our record of actions via logs, is used to offer a more useful experience. For companies, logs are also very valuable since they offer a way to monetize the service. Monetization is achieved by selling data to third parties, however query logs could potentially expose sensitive user information: identifiers, sensitive data from users (such as diseases, sexual tendencies, religious beliefs) or be used for what is called ”life-logging”: a continuous record of one’s daily activities. Current regulations oblige companies to protect this personal information. Protection systems for closed data sets have previously been proposed, most of them working with atomic files or structured data. Unfortunately, those systems do not fit when used in the growing real-time unstructured data environment posed by Internet services. This thesis aims to design techniques to protect the user’s sensitive information in a non-structured real-time streaming environment, guaranteeing a trade-off between data utility and protection. In this regard, three proposals have been made in efficient log protection. The first is a new method to anonymize query logs, based on probabilistic k-anonymity and some de-anonymization tools to determine possible data leaks. A second method has been improved in terms of a configurable trade-off between privacy and usability, achieving a great improvement in terms of data utility. Our final contribution concerns Internet-based Personal Assistants. The information generated by these devices is likely to be considered life-logging, and it can increase the user’s privacy risks. The proposal is a protection scheme that combines log anonymization and sanitizable signatures

    Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores

    Get PDF
    The publication of machine-readable information has been significantly increasing both in the magnitude and complexity of the embedded relations. The Resource Description Framework(RDF) plays a big role in modeling and linking web data and their relations. In line with that important role, dedicated systems were designed to store and query the RDF data using a special queering language called SPARQL similar to the classic SQL. However, due to the high size of the data, several federated working nodes were used to host a distributed RDF store. The data needs to be partitioned, assigned, and stored in each working node. After partitioning, some of the data needs to be replicated in order to avoid the communication cost, and balance the loads for better system throughput. Since replications require more storage space, the important two questions are: what data to replicate? And how much? The answer to the second question is related to other storage-space requirements at each working node like indexes and cache. In order to efficiently answer SPARQL queries, each working node needs to put its share of data into multiple indexes. Those indexes have a data-wide size and consume a considerable amount of storage space. In this context, the same two questions about replications are also raised about indexes. The third storage-consuming structure is the join cache. It is a special index where the frequent join results are cached and save a considerable amount of running time on the cost of high storage space consumption. Again, the same two questions of replication and indexes are applicable to the join-cache. In this thesis, we present a universal adaption approach to the storage of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. To achieve this, we present a cost model based on the workload that often contains frequent patterns. The workload is dynamically analyzed to evaluate predefined rules. Those rules tell the system about the benefits and costs of assigning which data to what structure. The objective is to have better query execution time. Besides the storage adaption, the system adapts its processing resources with the queries' arrival rate. The aim of this adaption is to have better parallelization per query while still provides high system throughput

    Decentralized Personal Data Marketplaces: How Participation in a DAO Can Support the Production of Citizen-Generated Data

    Get PDF
    Big Tech companies operating in a data-driven economy offer services that rely on their users’ personal data and usually store this personal information in “data silos” that prevent transparency about their use and opportunities for data sharing for public interest. In this paper, we present a solution that promotes the development of decentralized personal data marketplaces, exploiting the use of Distributed Ledger Technologies (DLTs), Decentralized File Storages (DFS) and smart contracts for storing personal data and managing access control in a decentralized way. Moreover, we focus on the issue of a lack of efficient decentralized mechanisms in DLTs and DFSs for querying a certain type of data. For this reason, we propose the use of a hypercube-structured Distributed Hash Table (DHT) on top of DLTs, organized for efficient processing of multiple keyword-based queries on the ledger data. We test our approach with the implementation of a use case regarding the creation of citizen-generated data based on direct participation and the involvement of a Decentralized Autonomous Organization (DAO). The performance evaluation demonstrates the viability of our approach for decentralized data searches, distributed authorization mechanisms and smart contract exploitation

    On relational learning and discovery in social networks: a survey

    Get PDF
    The social networking scene has evolved tremendously over the years. It has grown in relational complexities that extend a vast presence onto popular social media platforms on the internet. With the advance of sentimental computing and social complexity, relationships which were once thought to be simple have now become multi-dimensional and widespread in the online scene. This explosion in the online social scene has attracted much research attention. The main aims of this work revolve around the knowledge discovery and datamining processes of these feature-rich relations. In this paper, we provide a survey of relational learning and discovery through popular social analysis of different structure types which are integral to applications within the emerging field of sentimental and affective computing. It is hoped that this contribution will add to the clarity of how social networks are analyzed with the latest groundbreaking methods and provide certain directions for future improvements

    Tunable Security for Deployable Data Outsourcing

    Get PDF
    Security mechanisms like encryption negatively affect other software quality characteristics like efficiency. To cope with such trade-offs, it is preferable to build approaches that allow to tune the trade-offs after the implementation and design phase. This book introduces a methodology that can be used to build such tunable approaches. The book shows how the proposed methodology can be applied in the domains of database outsourcing, identity management, and credential management

    Privaatsuskaitse tehnoloogiaid äriprotsesside kaeveks

    Get PDF
    Protsessikaeve tehnikad võimaldavad organisatsioonidel analüüsida protsesside täitmise käigus tekkivaid logijälgi eesmärgiga leida parendusvõimalusi. Nende tehnikate eelduseks on, et nimetatud logijälgi koondavad sündmuslogid on andmeanalüütikutele analüüside läbi viimiseks kättesaadavad. Sellised sündmuslogid võivad sisaldada privaatset informatsiooni isikute kohta kelle jaoks protsessi täidetakse. Sellistel juhtudel peavad organisatsioonid rakendama privaatsuskaitse tehnoloogiaid (PET), et võimaldada analüütikul sündmuslogi põhjal järeldusi teha, samas säilitades isikute privaatsust. Kuigi PET tehnikad säilitavad isikute privaatsust organisatsiooni siseselt, muudavad nad ühtlasi sündmuslogisid sellisel viisil, mis võib viia analüüsi käigus valede järeldusteni. PET tehnikad võivad lisada sündmuslogidesse sellist uut käitumist, mille esinemine ei ole reaalses sündmuslogis võimalik. Näiteks võivad mõned PET tehnikad haigla sündmuslogi anonüümimisel lisada logijälje, mille kohaselt patsient külastas arsti enne haiglasse saabumist. Käesolev lõputöö esitab privaatsust säilitavate lähenemiste komplekti nimetusega privaatsust säilitav protsessikaeve (PPPM). PPPM põhiline eesmärk on leida tasakaal võimaliku sündmuslogi analüüsist saadava kasu ja analüüsile kohaldatavate privaatsusega seonduvate regulatsioonide (näiteks GDPR) vahel. Lisaks pakub käesolev lõputöö lahenduse, mis võimaldab erinevatel organisatsioonidel protsessikaevet üle ühise andmete terviku rakendada, ilma oma privaatseid andmeid üksteisega jagamata. Käesolevas lõputöös esitatud tehnikad on avatud lähtekoodiga tööriistadena kättesaadavad. Nendest tööriistadest esimene on Amun, mis võimaldab sündmuslogi omanikul sündmuslogi anonüümida enne selle analüütikule jagamist. Teine tööriist on Libra, mis pakub täiendatud võimalusi kasutatavuse ja privaatsuse tasakaalu leidmiseks. Kolmas tööriist on Shareprom, mis võimaldab organisatsioonidele ühiste protsessikaartide loomist sellisel viisil, et ükski osapool ei näe teiste osapoolte andmeid.Process Mining Techniques enable organizations to analyze process execution traces to identify improvement opportunities. Such techniques need the event logs (which record process execution) to be available for data analysts to perform the analysis. These logs contain private information about the individuals for whom a process is being executed. In such cases, organizations need to deploy Privacy-Enhancing Technologies (PETs) to enable the analyst to drive conclusions from the event logs while preserving the privacy of individuals. While PETs techniques preserve the privacy of individuals inside the organization, they work by perturbing the event logs in such a way that may lead to misleading conclusions of the analysis. They may inject new behaviors into the event logs that are impossible to exist in real-life event logs. For example, some PETs techniques anonymize a hospital event log by injecting a trace that a patient may visit a doctor before checking in inside the hospital. In this thesis, we propose a set of privacy-preserving approaches that we call Privacy-Preserving Process Mining (PPPM) approaches to strike a balance between the benefits an analyst can get from analyzing these event logs and the requirements imposed on them by privacy regulations (e.g., GDPR). Also, in this thesis, we propose an approach that enables organizations to jointly perform process mining over their data without sharing their private information. The techniques proposed in this thesis have been proposed as open-source tools. The first tool is Amun, enabling an event log publisher to anonymize their event log before sharing it with an analyst. The second tool is called Libra, which provides an enhanced utility-privacy tradeoff. The third tool is Shareprom, which enables organizations to construct process maps jointly in such a manner that no party learns the data of the other parties.https://www.ester.ee/record=b552434
    corecore