5,554 research outputs found
TIARA: Trust Management, Intrusion-tolerance, Accountability, and Reconstitution Architecture
The last 20 years have led to unprecedented improvements in chipdensity and system performance fueled mainly by Moore's Law. Duringthe same time, system and application software have bloated, leadingto unmanageable complexity, vulnerability to attack, rigidity and lackof robustness and accountability. These problems arise from the factthat all key elements of the computational environment, from hardwarethrough system software and middleware to application code regard theworld as consisting of unconstrained ``raw seething bits''. No elementof the entire stack is responsible for enforcing over-archingconventions of memory structuring or access control. Outsiders mayeasily penetrate the system by exploiting vulnerabilities (e.g. bufferoverflows) arising from this lack of basic constraints. Attacks arenot easily contained, whether they originate from the clever outsiderwho penetrates the defenses or from the insider who exploits existingprivileges. Finally, because there are no facilities for tracing theprovenance of data, even when an attack is detected, it is difficultif not impossible to tell which data are traceable to the attack andwhat data may still be trusted. We have abundant computational resources allowing us to fix thesecritical problems using a combination of hardware, system software,and programming language technology: In this report, we describe theTIARAproject, which is using these resources to design a newcomputer system thatis less vulnerable, more tolerant of intrusions, capable of recoveryfrom attacks, and accountable for their actions. TIARA provides thesecapabilities without significant impact on overall system performance. Itachieves these goals through the judicious use of a modest amountof extra, but reasonably generable purpose, hardware that is dedicatedto tracking the provenance of data at a very fine grained level, toenforcing access control policies, and to constructing a coherentobject-oriented model of memory. This hardware runs in parallel withthe main data-paths of the system and operates on a set of extra bitstagging each word with data-type, bounds, access control andprovenance information. Operations that violate the intendedinvariants are trapped, while normal results are tagged withinformation derived from the tags of the input operands.This hardware level provides fine-grained support for a series ofsoftware layers that enable a variety of comprehensive access controlpolicies, self-adaptive computing, and fine-grained recoveryprocessing. The first of these software layers establishes aconsistent object-oriented level of computing while higher layersestablish wrappers that may not be bypassed, access controls, dataprovenance tracking. At the highest level we create the ``planlevel'' of computing in which code is executed in parallel with anabstract model (or executable specification) of the system that checkswhether the code behaves as intended
A provenance metadata model integrating ISO geospatial lineage and the OGC WPS : conceptual model and implementation
Nowadays, there are still some gaps in the description of provenance metadata. These gaps prevent the capture of comprehensive provenance, useful for reuse and reproducibility. In addition, the lack of automated tools for capturing provenance hinders the broad generation and compilation of provenance information. This work presents a provenance engine (PE) that captures and represents provenance information using a combination of the Web Processing Service (WPS) standard and the ISO 19115 geospatial lineage model. The PE, developed within the MiraMon GIS & RS software, automatically records detailed information about sources and processes. The PE also includes a metadata editor that shows a graphical representation of the provenance and allows users to complement provenance information by adding missing processes or deleting redundant process steps or sources, thus building a consistent geospatial workflow. One use case is presented to demonstrate the usefulness and effectiveness of the PE: the generation of a radiometric pseudo-invariant areas bench for the Iberian Peninsula. This remote-sensing use case shows how provenance can be automatically captured, also in a non-sequential complex flow, and its essential role in the automation and replication tasks in work with very large amounts of geospatial data
Reprodutibilidade e reuso de experimentos em eScience : workflows, ontologias e scripts
Orientadores: Claudia Maria Bauzer Medeiros, Yolanda GilTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Scripts e Sistemas Gerenciadores de Workflows CientĂficos (SGWfC) sĂŁo abordagens comumente utilizadas para automatizar o fluxo de processos e anĂĄlise de dados em experimentos cientĂficos computacionais. Apesar de amplamente usados em diversas disciplinas, scripts sĂŁo difĂceis de entender, adaptar, reusar e reproduzir. Por esta razĂŁo, diversas soluçÔes tĂȘm sido propostas para auxiliar na reprodutibilidade de experimentos que utilizam ambientes baseados em scripts. PorĂ©m, estas soluçÔes nĂŁo permitem a documentação completa do experimento, nem ajudam quando outros cientistas querem reusar apenas parte do cĂłdigo do script. SGWfCs, por outro lado, ajudam na documentação e reuso atravĂ©s do suporte aos cientistas durante a modelagem e execução dos seus experimentos, que sĂŁo especificados e executados como componentes interconectados (reutilizĂĄveis) de workflows. Enquanto workflows sĂŁo melhores que scripts para entendimento e reuso dos experimentos, eles tambĂ©m exigem documentação adicional. Durante a modelagem de um experimento, cientistas frequentemente criam variantes de workflows, e.g., mudando componentes do workflow. Reuso e reprodutibilidade exigem o entendimento e rastreamento da proveniĂȘncia das variantes, uma tarefa que consome muito tempo. Esta tese tem como objetivo auxiliar na reprodutibilidade e reuso de experimentos computacionais. Para superar estes desafios, nĂłs lidamos com dois problemas de pesquisas: (1) entendimento de um experimento computacional, e (2) extensĂŁo de um experimento computacional. Nosso trabalho para resolver estes problemas nos direcionou na escolha de workflows e ontologias como respostas para ambos os problemas. As principais contribuiçÔes desta tese sĂŁo: (i) apresentar os requisitos para a conversĂŁo de experimentos baseados em scripts em experimentos reprodutĂveis; (ii) propor uma metodologia que guia o cientista durante o processo de conversĂŁo de experimentos baseados em scripts em workflow research objects reprodutĂveis. (iii) projetar e implementar funcionalidades para avaliação da qualidade de experimentos computacionais; (iv) projetar e implementar o W2Share, um arcabouço para auxiliar a metodologia de conversĂŁo, que explora ferramentas e padrĂ”es que foram desenvolvidos pela comunidade cientĂfica para promover o reuso e reprodutibilidade; (v) projetar e implementar o OntoSoft-VFF, um arcabouço para captura de informação sobre software e componentes de workflow para auxiliar cientistas a gerenciarem a exploração e evolução de workflows. Nosso trabalho Ă© apresentado via casos de uso em DinĂąmica Molecular, BioinformĂĄtica e PrevisĂŁo do TempoAbstract: Scripts and Scientific Workflow Management Systems (SWfMSs) are common approaches that have been used to automate the execution flow of processes and data analysis in scientific (computational) experiments. Although widely used in many disciplines, scripts are hard to understand, adapt, reuse, and reproduce. For this reason, several solutions have been proposed to aid experiment reproducibility for script-based environments. However, they neither allow to fully document the experiment nor do they help when third parties want to reuse just part of the code. SWfMSs, on the other hand, help documentation and reuse by supporting scientists in the design and execution of their experiments, which are specified and run as interconnected (reusable) workflow components (a.k.a. building blocks). While workflows are better than scripts for understandability and reuse, they still require additional documentation. During experiment design, scientists frequently create workflow variants, e.g., by changing workflow components. Reuse and reproducibility require understanding and tracking variant provenance, a time-consuming task. This thesis aims to support reproducibility and reuse of computational experiments. To meet these challenges, we address two research problems: (1) understanding a computational experiment, and (2) extending a computational experiment. Our work towards solving these problems led us to choose workflows and ontologies to answer both problems. The main contributions of this thesis are thus: (i) to present the requirements for the conversion of script to reproducible research; (ii) to propose a methodology that guides the scientists through the process of conversion of script-based experiments into reproducible workflow research objects; (iii) to design and implement features for quality assessment of computational experiments; (iv) to design and implement W2Share, a framework to support the conversion methodology, which exploits tools and standards that have been developed by the scientific community to promote reuse and reproducibility; (v) to design and implement OntoSoft-VFF, a framework for capturing information about software and workflow components to support scientists manage workflow exploration and evolution. Our work is showcased via use cases in Molecular Dynamics, Bioinformatics and Weather ForecastingDoutoradoCiĂȘncia da ComputaçãoDoutor em CiĂȘncia da Computação2013/08293-7, 2014/23861-4, 2017/03570-3FAPES
Towards Building a Knowledge Base of Monetary Transactions from a News Collection
We address the problem of extracting structured representations of economic
events from a large corpus of news articles, using a combination of natural
language processing and machine learning techniques. The developed techniques
allow for semi-automatic population of a financial knowledge base, which, in
turn, may be used to support a range of data mining and exploration tasks. The
key challenge we face in this domain is that the same event is often reported
multiple times, with varying correctness of details. We address this challenge
by first collecting all information pertinent to a given event from the entire
corpus, then considering all possible representations of the event, and
finally, using a supervised learning method, to rank these representations by
the associated confidence scores. A main innovative element of our approach is
that it jointly extracts and stores all attributes of the event as a single
representation (quintuple). Using a purpose-built test set we demonstrate that
our supervised learning approach can achieve 25% improvement in F1-score over
baseline methods that consider the earliest, the latest or the most frequent
reporting of the event.Comment: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital
Libraries (JCDL '17), 201
Recommended from our members
The National Transport Data Framework
Report by Professor Peter Landshoff (Cambridge University) and
Professor John Polak (Imperial College London) on a project for
the Department for Transport.
emails: [email protected] [email protected] NTDF is designed to be a resource for data owners to deposit descriptions
into a central catalogue, so that people can search for data and find data
and understand their characteristics. The value of this is to individuals, to
commercial organizations, and to public bodies. For example, services that
provide better information to travellers will help to make their journey
less stressful and persuade them to make more use of public transport.
Transport operators need very diverse information to help them
plan developments to their services: demographic, geographical, economic etc.
And policy makers need a similar range of information to help them decide
how to divide their budget and afterwards to evaluate how valuable it has
been.This work was supported by the Department for Transport (DfT)
Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires
Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a âtraining and test datasetâ, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metricsâ values, which enable model tuning as well as transparent management of data and experiments
Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires
Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a âtraining and test datasetâ, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metricsâ values, which enable model tuning as well as transparent management of data and experiments
- âŠ