5,554 research outputs found

    TIARA: Trust Management, Intrusion-tolerance, Accountability, and Reconstitution Architecture

    Get PDF
    The last 20 years have led to unprecedented improvements in chipdensity and system performance fueled mainly by Moore's Law. Duringthe same time, system and application software have bloated, leadingto unmanageable complexity, vulnerability to attack, rigidity and lackof robustness and accountability. These problems arise from the factthat all key elements of the computational environment, from hardwarethrough system software and middleware to application code regard theworld as consisting of unconstrained ``raw seething bits''. No elementof the entire stack is responsible for enforcing over-archingconventions of memory structuring or access control. Outsiders mayeasily penetrate the system by exploiting vulnerabilities (e.g. bufferoverflows) arising from this lack of basic constraints. Attacks arenot easily contained, whether they originate from the clever outsiderwho penetrates the defenses or from the insider who exploits existingprivileges. Finally, because there are no facilities for tracing theprovenance of data, even when an attack is detected, it is difficultif not impossible to tell which data are traceable to the attack andwhat data may still be trusted. We have abundant computational resources allowing us to fix thesecritical problems using a combination of hardware, system software,and programming language technology: In this report, we describe theTIARAproject, which is using these resources to design a newcomputer system thatis less vulnerable, more tolerant of intrusions, capable of recoveryfrom attacks, and accountable for their actions. TIARA provides thesecapabilities without significant impact on overall system performance. Itachieves these goals through the judicious use of a modest amountof extra, but reasonably generable purpose, hardware that is dedicatedto tracking the provenance of data at a very fine grained level, toenforcing access control policies, and to constructing a coherentobject-oriented model of memory. This hardware runs in parallel withthe main data-paths of the system and operates on a set of extra bitstagging each word with data-type, bounds, access control andprovenance information. Operations that violate the intendedinvariants are trapped, while normal results are tagged withinformation derived from the tags of the input operands.This hardware level provides fine-grained support for a series ofsoftware layers that enable a variety of comprehensive access controlpolicies, self-adaptive computing, and fine-grained recoveryprocessing. The first of these software layers establishes aconsistent object-oriented level of computing while higher layersestablish wrappers that may not be bypassed, access controls, dataprovenance tracking. At the highest level we create the ``planlevel'' of computing in which code is executed in parallel with anabstract model (or executable specification) of the system that checkswhether the code behaves as intended

    A provenance metadata model integrating ISO geospatial lineage and the OGC WPS : conceptual model and implementation

    Get PDF
    Nowadays, there are still some gaps in the description of provenance metadata. These gaps prevent the capture of comprehensive provenance, useful for reuse and reproducibility. In addition, the lack of automated tools for capturing provenance hinders the broad generation and compilation of provenance information. This work presents a provenance engine (PE) that captures and represents provenance information using a combination of the Web Processing Service (WPS) standard and the ISO 19115 geospatial lineage model. The PE, developed within the MiraMon GIS & RS software, automatically records detailed information about sources and processes. The PE also includes a metadata editor that shows a graphical representation of the provenance and allows users to complement provenance information by adding missing processes or deleting redundant process steps or sources, thus building a consistent geospatial workflow. One use case is presented to demonstrate the usefulness and effectiveness of the PE: the generation of a radiometric pseudo-invariant areas bench for the Iberian Peninsula. This remote-sensing use case shows how provenance can be automatically captured, also in a non-sequential complex flow, and its essential role in the automation and replication tasks in work with very large amounts of geospatial data

    Reprodutibilidade e reuso de experimentos em eScience : workflows, ontologias e scripts

    Get PDF
    Orientadores: Claudia Maria Bauzer Medeiros, Yolanda GilTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Scripts e Sistemas Gerenciadores de Workflows CientĂ­ficos (SGWfC) sĂŁo abordagens comumente utilizadas para automatizar o fluxo de processos e anĂĄlise de dados em experimentos cientĂ­ficos computacionais. Apesar de amplamente usados em diversas disciplinas, scripts sĂŁo difĂ­ceis de entender, adaptar, reusar e reproduzir. Por esta razĂŁo, diversas soluçÔes tĂȘm sido propostas para auxiliar na reprodutibilidade de experimentos que utilizam ambientes baseados em scripts. PorĂ©m, estas soluçÔes nĂŁo permitem a documentação completa do experimento, nem ajudam quando outros cientistas querem reusar apenas parte do cĂłdigo do script. SGWfCs, por outro lado, ajudam na documentação e reuso atravĂ©s do suporte aos cientistas durante a modelagem e execução dos seus experimentos, que sĂŁo especificados e executados como componentes interconectados (reutilizĂĄveis) de workflows. Enquanto workflows sĂŁo melhores que scripts para entendimento e reuso dos experimentos, eles tambĂ©m exigem documentação adicional. Durante a modelagem de um experimento, cientistas frequentemente criam variantes de workflows, e.g., mudando componentes do workflow. Reuso e reprodutibilidade exigem o entendimento e rastreamento da proveniĂȘncia das variantes, uma tarefa que consome muito tempo. Esta tese tem como objetivo auxiliar na reprodutibilidade e reuso de experimentos computacionais. Para superar estes desafios, nĂłs lidamos com dois problemas de pesquisas: (1) entendimento de um experimento computacional, e (2) extensĂŁo de um experimento computacional. Nosso trabalho para resolver estes problemas nos direcionou na escolha de workflows e ontologias como respostas para ambos os problemas. As principais contribuiçÔes desta tese sĂŁo: (i) apresentar os requisitos para a conversĂŁo de experimentos baseados em scripts em experimentos reprodutĂ­veis; (ii) propor uma metodologia que guia o cientista durante o processo de conversĂŁo de experimentos baseados em scripts em workflow research objects reprodutĂ­veis. (iii) projetar e implementar funcionalidades para avaliação da qualidade de experimentos computacionais; (iv) projetar e implementar o W2Share, um arcabouço para auxiliar a metodologia de conversĂŁo, que explora ferramentas e padrĂ”es que foram desenvolvidos pela comunidade cientĂ­fica para promover o reuso e reprodutibilidade; (v) projetar e implementar o OntoSoft-VFF, um arcabouço para captura de informação sobre software e componentes de workflow para auxiliar cientistas a gerenciarem a exploração e evolução de workflows. Nosso trabalho Ă© apresentado via casos de uso em DinĂąmica Molecular, BioinformĂĄtica e PrevisĂŁo do TempoAbstract: Scripts and Scientific Workflow Management Systems (SWfMSs) are common approaches that have been used to automate the execution flow of processes and data analysis in scientific (computational) experiments. Although widely used in many disciplines, scripts are hard to understand, adapt, reuse, and reproduce. For this reason, several solutions have been proposed to aid experiment reproducibility for script-based environments. However, they neither allow to fully document the experiment nor do they help when third parties want to reuse just part of the code. SWfMSs, on the other hand, help documentation and reuse by supporting scientists in the design and execution of their experiments, which are specified and run as interconnected (reusable) workflow components (a.k.a. building blocks). While workflows are better than scripts for understandability and reuse, they still require additional documentation. During experiment design, scientists frequently create workflow variants, e.g., by changing workflow components. Reuse and reproducibility require understanding and tracking variant provenance, a time-consuming task. This thesis aims to support reproducibility and reuse of computational experiments. To meet these challenges, we address two research problems: (1) understanding a computational experiment, and (2) extending a computational experiment. Our work towards solving these problems led us to choose workflows and ontologies to answer both problems. The main contributions of this thesis are thus: (i) to present the requirements for the conversion of script to reproducible research; (ii) to propose a methodology that guides the scientists through the process of conversion of script-based experiments into reproducible workflow research objects; (iii) to design and implement features for quality assessment of computational experiments; (iv) to design and implement W2Share, a framework to support the conversion methodology, which exploits tools and standards that have been developed by the scientific community to promote reuse and reproducibility; (v) to design and implement OntoSoft-VFF, a framework for capturing information about software and workflow components to support scientists manage workflow exploration and evolution. Our work is showcased via use cases in Molecular Dynamics, Bioinformatics and Weather ForecastingDoutoradoCiĂȘncia da ComputaçãoDoutor em CiĂȘncia da Computação2013/08293-7, 2014/23861-4, 2017/03570-3FAPES

    Towards Building a Knowledge Base of Monetary Transactions from a News Collection

    Full text link
    We address the problem of extracting structured representations of economic events from a large corpus of news articles, using a combination of natural language processing and machine learning techniques. The developed techniques allow for semi-automatic population of a financial knowledge base, which, in turn, may be used to support a range of data mining and exploration tasks. The key challenge we face in this domain is that the same event is often reported multiple times, with varying correctness of details. We address this challenge by first collecting all information pertinent to a given event from the entire corpus, then considering all possible representations of the event, and finally, using a supervised learning method, to rank these representations by the associated confidence scores. A main innovative element of our approach is that it jointly extracts and stores all attributes of the event as a single representation (quintuple). Using a purpose-built test set we demonstrate that our supervised learning approach can achieve 25% improvement in F1-score over baseline methods that consider the earliest, the latest or the most frequent reporting of the event.Comment: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '17), 201

    Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

    Get PDF
    Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
 While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
 This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
 The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments

    Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

    Get PDF
    Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
 While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
 This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
 The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments
    • 

    corecore