699 research outputs found

    Counteracting Bloom Filter Encoding Techniques for Private Record Linkage

    Get PDF
    Record Linkage is a process of combining records representing same entity spread across multiple and different data sources, primarily for data analytics. Traditionally, this could be performed with comparing personal identifiers present in data (e.g., given name, surname, social security number etc.). However, sharing information across databases maintained by disparate organizations leads to exchange of personal information pertaining to an individual. In practice, various statutory regulations and policies prohibit the disclosure of such identifiers. Private record linkage (PRL) techniques have been implemented to execute record linkage without disclosing any information about other dissimilar records. Various techniques have been proposed to implement PRL, including cryptographically secure multi-party computational protocols. However, these protocols have been debated over the scalability factors as they are computationally extensive by nature. Bloom filter encoding (BFE) for private record linkage has become a topic of recent interest in the medical informatics community due to their versatility and ability to match records approximately in a manner that is (ostensibly) privacy-preserving. It also has the advantage of computing matches directly in plaintext space making them much faster than their secure mutli-party computation counterparts. The trouble with BFEs lies in their security guarantees: by their very nature BFEs leak information to assist in the matching process. Despite this known shortcoming, BFEs continue to be studied in the context of new heuristically designed countermeasures to address known attacks. A new class of set-intersection attack is proposed in this thesis which re-examines the security of BFEs by conducting experiments, demonstrating an inverse relationship between security and accuracy. With real-world deployment of BFEs in the health information sector approaching, the results from this work will generate renewed discussion around the security of BFEs as well as motivate research into new, more efficient multi-party protocols for private approximate matching

    A survey on Data Extraction and Data Duplication Detection

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. This Paper review the literature on duplicate detection and data fusion (remov e and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Scalable Quality Assessment of Linked Data

    Get PDF
    In a world where the information economy is booming, poor data quality can lead to adverse consequences, including social and economical problems such as decrease in revenue. Furthermore, data-driven indus- tries are not just relying on their own (proprietary) data silos, but are also continuously aggregating data from different sources. This aggregation could then be re-distributed back to “data lakes”. However, this data (including Linked Data) is not necessarily checked for its quality prior to its use. Large volumes of data are being exchanged in a standard and interoperable format between organisations and published as Linked Data to facilitate their re-use. Some organisations, such as government institutions, take a step further and open their data. The Linked Open Data Cloud is a witness to this. However, similar to data in data lakes, it is challenging to determine the quality of this heterogeneous data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data quality, the current solutions do not aggregate a holistic approach that enables both the assessment of datasets and also provides consumers with quality results that can then be used to find, compare and rank datasets’ fitness for use. In this thesis we investigate methods to assess the quality of (possibly large) linked datasets with the intent that data consumers can then use the assessment results to find datasets that are fit for use, that is; finding the right dataset for the task at hand. Moreover, the benefits of quality assessment are two-fold: (1) data consumers do not need to blindly rely on subjective measures to choose a dataset, but base their choice on multiple factors such as the intrinsic structure of the dataset, therefore fostering trust and reputation between the publishers and consumers on more objective foundations; and (2) data publishers can be encouraged to improve their datasets so that they can be re-used more. Furthermore, our approach scales for large datasets. In this regard, we also look into improving the efficiency of quality metrics using various approximation techniques. However the trade-off is that consumers will not get the exact quality value, but a very close estimate which anyway provides the required guidance towards fitness for use. The central point of this thesis is not on data quality improvement, nonetheless, we still need to understand what data quality means to the consumers who are searching for potential datasets. This thesis looks into the challenges faced to detect quality problems in linked datasets presenting quality results in a standardised machine-readable and interoperable format for which agents can make sense out of to help human consumers identifying the fitness for use dataset. Our proposed approach is more consumer-centric where it looks into (1) making the assessment of quality as easy as possible, that is, allowing stakeholders, possibly non-experts, to identify and easily define quality metrics and to initiate the assessment; and (2) making results (quality metadata and quality reports) easy for stakeholders to understand, or at least interoperable with other systems to facilitate a possible data quality pipeline. Finally, our framework is used to assess the quality of a number of heterogeneous (large) linked datasets, where each assessment returns a quality metadata graph that can be consumed by agents as Linked Data. In turn, these agents can intelligently interpret a dataset’s quality with regard to multiple dimensions and observations, and thus provide further insight to consumers regarding its fitness for use

    MOLECULAR AND ECOLOGICAL ASPECTS OF THE INTERACTIONS BETWEEN \u3ci\u3eAUREOCOCCUS ANOPHAGEFFERENS\u3c/i\u3e AND ITS GIANT VIRUS

    Get PDF
    Viruses are increasingly being recognized as an important biotic component of all ecosystems including agents that control the rapid ecological events that are harmful algal blooms (HABS). Aureococcus anophagefferens is a pelagophyte which causes recurrent ecosystem devastating brown tide blooms along the east coast of the USA and has recently spread to China and South Africa. It has been suggested that a large virus (AaV) is possibly an important agent for demise of brown tide blooms. This observation is consistent with the recognition of a number of other giant viruses modulating algal blooms in marine systems. In this dissertation, we investigated both the molecular underpinnings of Aureococcus-AaV interactions and the dynamics of AaV and the associated viral community in situ. We determined the genome sequence and phylogenetic history of AaV using high throughput sequencing approach and revealed it’s intertwined evolutionary history with the host and other organisms. Building upon the available genome of AaV and its host, we took an RNA-seq approach to provide insights on the physiological state of the AaV-infected Aureococcus ‘virocell’ that is geared towards virus production. In situ activity of AaV was detected by targeted amplicon and high throughput community RNA sequencing (metatranscriptomics) from Quantuck Bay, NY, a site with recurrent brown tide blooms. AaV and associated giant algal viruses in the Mimiviridae clade were found to respond to environmental changes, indicating that this newly recognized phylogenetic group is an important contributor to the eukaryotic phytoplankton dynamics. Analyzing time series metatranscriptomics from two distinct coastal sites recovered diverse viruses infecting microeukaryotes (including AaV) as part of interacting networks of viruses and microeukaryotes. Results from these studies testify AaV as an important factor for brown tide bloom demise, reveals the molecular underpinnings of AaV-host interactions and establishes the ecological relevance of Mimivirus-like algal viruses. We also provide foundation for using metatranscriptomics as an important tool in marine virus ecology – capable of recovering associations among coexisting marine microeukaryotes and viruses

    Leveraging content properties to optimize distributed storage systems

    Get PDF
    Les fournisseurs de services de cloud computing, les réseaux sociaux et les entreprises de gestion des données ont assisté à une augmentation considérable du volume de données qu'ils reçoivent chaque jour. Toutes ces données créent des nouvelles opportunités pour étendre la connaissance humaine dans des domaines comme la santé, l'urbanisme et le comportement humain et permettent d'améliorer les services offerts comme la recherche, la recommandation, et bien d'autres. Ce n'est pas par accident que plusieurs universitaires mais aussi les médias publics se référent à notre époque comme l'époque Big Data . Mais ces énormes opportunités ne peuvent être exploitées que grâce à de meilleurs systèmes de gestion de données. D'une part, ces derniers doivent accueillir en toute sécurité ce volume énorme de données et, d'autre part, être capable de les restituer rapidement afin que les applications puissent bénéficier de leur traite- ment. Ce document se concentre sur ces deux défis relatifs aux Big Data . Dans notre étude, nous nous concentrons sur le stockage de sauvegarde (i) comme un moyen de protéger les données contre un certain nombre de facteurs qui peuvent les rendre indisponibles et (ii) sur le placement des données sur des systèmes de stockage répartis géographiquement, afin que les temps de latence perçue par l'utilisateur soient minimisés tout en utilisant les ressources de stockage et du réseau efficacement. Tout au long de notre étude, les données sont placées au centre de nos choix de conception dont nous essayons de tirer parti des propriétés de contenu à la fois pour le placement et le stockage efficace.Cloud service providers, social networks and data-management companies are witnessing a tremendous increase in the amount of data they receive every day. All this data creates new opportunities to expand human knowledge in fields like healthcare and human behavior and improve offered services like search, recommendation, and many others. It is not by accident that many academics but also public media refer to our era as the Big Data era. But these huge opportunities come with the requirement for better data management systems that, on one hand, can safely accommodate this huge and constantly increasing volume of data and, on the other, serve them in a timely and useful manner so that applications can benefit from processing them. This document focuses on the above two challenges that come with Big Data . In more detail, we study (i) backup storage systems as a means to safeguard data against a number of factors that may render them unavailable and (ii) data placement strategies on geographically distributed storage systems, with the goal to reduce the user perceived latencies and the network and storage resources are efficiently utilized. Throughout our study, data are placed in the centre of our design choices as we try to leverage content properties for both placement and efficient storage.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF

    Classifiers and machine learning techniques for image processing and computer vision

    Get PDF
    Orientador: Siome Klein GoldensteinTese (doutorado) - Universidade Estadual de Campinas, Instituto da ComputaçãoResumo: Neste trabalho de doutorado, propomos a utilizaçãoo de classificadores e técnicas de aprendizado de maquina para extrair informações relevantes de um conjunto de dados (e.g., imagens) para solução de alguns problemas em Processamento de Imagens e Visão Computacional. Os problemas de nosso interesse são: categorização de imagens em duas ou mais classes, detecçãao de mensagens escondidas, distinção entre imagens digitalmente adulteradas e imagens naturais, autenticação, multi-classificação, entre outros. Inicialmente, apresentamos uma revisão comparativa e crítica do estado da arte em análise forense de imagens e detecção de mensagens escondidas em imagens. Nosso objetivo é mostrar as potencialidades das técnicas existentes e, mais importante, apontar suas limitações. Com esse estudo, mostramos que boa parte dos problemas nessa área apontam para dois pontos em comum: a seleção de características e as técnicas de aprendizado a serem utilizadas. Nesse estudo, também discutimos questões legais associadas a análise forense de imagens como, por exemplo, o uso de fotografias digitais por criminosos. Em seguida, introduzimos uma técnica para análise forense de imagens testada no contexto de detecção de mensagens escondidas e de classificação geral de imagens em categorias como indoors, outdoors, geradas em computador e obras de arte. Ao estudarmos esse problema de multi-classificação, surgem algumas questões: como resolver um problema multi-classe de modo a poder combinar, por exemplo, caracteríisticas de classificação de imagens baseadas em cor, textura, forma e silhueta, sem nos preocuparmos demasiadamente em como normalizar o vetor-comum de caracteristicas gerado? Como utilizar diversos classificadores diferentes, cada um, especializado e melhor configurado para um conjunto de caracteristicas ou classes em confusão? Nesse sentido, apresentamos, uma tecnica para fusão de classificadores e caracteristicas no cenário multi-classe através da combinação de classificadores binários. Nós validamos nossa abordagem numa aplicação real para classificação automática de frutas e legumes. Finalmente, nos deparamos com mais um problema interessante: como tornar a utilização de poderosos classificadores binarios no contexto multi-classe mais eficiente e eficaz? Assim, introduzimos uma tecnica para combinação de classificadores binarios (chamados classificadores base) para a resolução de problemas no contexto geral de multi-classificação.Abstract: In this work, we propose the use of classifiers and machine learning techniques to extract useful information from data sets (e.g., images) to solve important problems in Image Processing and Computer Vision. We are particularly interested in: two and multi-class image categorization, hidden messages detection, discrimination among natural and forged images, authentication, and multiclassification. To start with, we present a comparative survey of the state-of-the-art in digital image forensics as well as hidden messages detection. Our objective is to show the importance of the existing solutions and discuss their limitations. In this study, we show that most of these techniques strive to solve two common problems in Machine Learning: the feature selection and the classification techniques to be used. Furthermore, we discuss the legal and ethical aspects of image forensics analysis, such as, the use of digital images by criminals. We introduce a technique for image forensics analysis in the context of hidden messages detection and image classification in categories such as indoors, outdoors, computer generated, and art works. From this multi-class classification, we found some important questions: how to solve a multi-class problem in order to combine, for instance, several different features such as color, texture, shape, and silhouette without worrying about the pre-processing and normalization of the combined feature vector? How to take advantage of different classifiers, each one custom tailored to a specific set of classes in confusion? To cope with most of these problems, we present a feature and classifier fusion technique based on combinations of binary classifiers. We validate our solution with a real application for automatic produce classification. Finally, we address another interesting problem: how to combine powerful binary classifiers in the multi-class scenario more effectively? How to boost their efficiency? In this context, we present a solution that boosts the efficiency and effectiveness of multi-class from binary techniques.DoutoradoEngenharia de ComputaçãoDoutor em Ciência da Computaçã

    Phenotypic analysis of the Plp1 gene overexpressing mouse model #72 : implications for demyelination and remyelination failure

    Get PDF
    Duplication of the proteolipid protein (PLP1) gene, which encodes the most abundant protein of central nervous system (CNS) myelin, is the most common cause of Pelizaeus Merzbacher disease (PMD). Various animal models have been generated to study the effect of Plp1 gene overexpression on oligodendrocyte and myelin sheath integrity. The #72 line harbours 3 additional copies of the murine Plp1 gene per haploidic chromosomal set. Homozygous #72 mice appear phenotypically normal until three months of age, after which they develop seizures leading to premature death at around 4 months of age. An earlier study examining the optic nerve showed a progressive demyelination accompanied by marked microglial and astrocytic responses. Using electron microscopy and immunohistochemistry, I demonstrated that initial myelination of the #72 corpus callosum was followed by a progressive demyelination, probably mediated by a distal “dying back” phenomenon of the myelin sheath. No evidence of effective remyelination was observed despite the presence and proliferation of oligodendrocyte progenitor cells (OPCs). A marked increase in density and reactivity of microglia/macrophages and astrocytes, and the occurrence of axonal swellings, accompanied the demyelination. In situ and in vitro evaluation of adult #72 OPCs provided evidence of impaired OPC differentiation. Transplantation of neurospheres (NS) into adult #72 mouse corpus callosum confirmed that axons were capable of undergoing remyelination. Furthermore, NS transplanted into neonatal CNS integrated into the parenchyma and survived up to 120 days, demonstrating the potential of early cell replacement therapy. Taking advantage of the spatially distinct pathologies between the retinal and chiasmal region of the #72 optic nerve, I evaluated the capability of diffusion weighted MRI to identify lesion type. I found significant differences between #72 and wild type optic nerves, as well as between the two distinct pathological regions within the #72 optic nerve. These results confirm the potential of the #72 mouse to serve as a model to study chronic demyelination. The study also demonstrates the utility of the #72 mouse to evaluate cell transplant strategies for the treatment of chronic CNS white matter lesions and PMD. Additionally, DW MRI has potential as a modality capable of diagnosing myelin-related white matter changes, and may be applicable to the clinical setting

    Visual attention and swarm cognition for off-road robots

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2011Esta tese aborda o problema da modelação de atenção visual no contexto de robôs autónomos todo-o-terreno. O objectivo de utilizar mecanismos de atenção visual é o de focar a percepção nos aspectos do ambiente mais relevantes à tarefa do robô. Esta tese mostra que, na detecção de obstáculos e de trilhos, esta capacidade promove robustez e parcimónia computacional. Estas são características chave para a rapidez e eficiência dos robôs todo-o-terreno. Um dos maiores desafios na modelação de atenção visual advém da necessidade de gerir o compromisso velocidade-precisão na presença de variações de contexto ou de tarefa. Esta tese mostra que este compromisso é resolvido se o processo de atenção visual for modelado como um processo auto-organizado, cuja operação é modulada pelo módulo de selecção de acção, responsável pelo controlo do robô. Ao fechar a malha entre o processo de selecção de acção e o de percepção, o último é capaz de operar apenas onde é necessário, antecipando as acções do robô. Para fornecer atenção visual com propriedades auto-organizadas, este trabalho obtém inspiração da Natureza. Concretamente, os mecanismos responsáveis pela capacidade que as formigas guerreiras têm de procurar alimento de forma auto-organizada, são usados como metáfora na resolução da tarefa de procurar, também de forma auto-organizada, obstáculos e trilhos no campo visual do robô. A solução proposta nesta tese é a de colocar vários focos de atenção encoberta a operar como um enxame, através de interacções baseadas em feromona. Este trabalho representa a primeira realização corporizada de cognição de enxame. Este é um novo campo de investigação que procura descobrir os princípios básicos da cognição, inspeccionando as propriedades auto-organizadas da inteligência colectiva exibida pelos insectos sociais. Logo, esta tese contribui para a robótica como disciplina de engenharia e para a robótica como disciplina de modelação, capaz de suportar o estudo do comportamento adaptável.Esta tese aborda o problema da modelação de atenção visual no contexto de robôs autónomos todo-o-terreno. O objectivo de utilizar mecanismos de atenção visual é o de focar a percepção nos aspectos do ambiente mais relevantes à tarefa do robô. Esta tese mostra que, na detecção de obstáculos e de trilhos, esta capacidade promove robustez e parcimónia computacional. Estas são características chave para a rapidez e eficiência dos robôs todo-o-terreno. Um dos maiores desafios na modelação de atenção visual advém da necessidade de gerir o compromisso velocidade-precisão na presença de variações de contexto ou de tarefa. Esta tese mostra que este compromisso é resolvido se o processo de atenção visual for modelado como um processo auto-organizado, cuja operação é modulada pelo módulo de selecção de acção, responsável pelo controlo do robô. Ao fechar a malha entre o processo de selecção de acção e o de percepção, o último é capaz de operar apenas onde é necessário, antecipando as acções do robô. Para fornecer atenção visual com propriedades auto-organizadas, este trabalho obtém inspi- ração da Natureza. Concretamente, os mecanismos responsáveis pela capacidade que as formi- gas guerreiras têm de procurar alimento de forma auto-organizada, são usados como metáfora na resolução da tarefa de procurar, também de forma auto-organizada, obstáculos e trilhos no campo visual do robô. A solução proposta nesta tese é a de colocar vários focos de atenção encoberta a operar como um enxame, através de interacções baseadas em feromona. Este trabalho representa a primeira realização corporizada de cognição de enxame. Este é um novo campo de investigação que procura descobrir os princípios básicos da cognição, ins- peccionando as propriedades auto-organizadas da inteligência colectiva exibida pelos insectos sociais. Logo, esta tese contribui para a robótica como disciplina de engenharia e para a robótica como disciplina de modelação, capaz de suportar o estudo do comportamento adaptável.Fundação para a Ciência e a Tecnologia (FCT,SFRH/BD/27305/2006); Laboratory of Agent Modelling (LabMag

    Prediction of Host-Microbe Interactions from Community High-Throughput Sequencing Data

    Get PDF
    Microbial ecology is a diverse field, with a broad range of taxa, habitats, and trophic structures studied. Many of the major areas of research were developed independently, each with their own unique methods and standards, and their own questions and focus. This has changed in recent decades with the widespread implementation of culture-independent techniques, which exploit mechanisms shared by all life, regardless of habitat. In particular, high-throughput sequencing of environmentally isolated DNA and RNA has done much to expand our knowledge of the planet’s microbial diversity and has allowed us to explore the complex interplay between community members. Additionally, metatranscriptomic data can be used to parse relationships between individual members of the community, allowing researchers to propose hypotheses that can be tested in a laboratory or field setting. However, use of this technology is still relatively young, and there is a considerable need for broader consideration of its pitfalls, as well as the development of novel approaches that allow those without a computational background or with fewer resources to navigate its challenges and reap its rewards. To address these needs, we have developed targeted computational approaches that simplify next-generation sequencing datasets to a more manageable size, and we have used these techniques to address specific questions in environmental ecosystems. In a dataset sequenced for the purpose of identifying ecological factors that drive Microcystis aeruginosa to dominate cyanobacterial harmful algal blooms worldwide, we used a targeted approach to predict replication and lysogenic dormancy in bacteriophage. We used RNA-seq data to characterize viral diversity in the Sphagnum peat bog microbiome, identifying a wealth of novel viruses and proposing several host-virus pairs. We were able to assemble and describe the genome of a freshwater giant virus as well as that of a virophage that may infect it, and we used our techniques to describe its activity in publicly available datasets. Lastly, we have extended our efforts into the realm of medicine where we showed the influence exerted by the mouse gut microbiome on the host immune response to malaria, identifying several genes that may play a key role in reducing disease severity
    corecore