472 research outputs found
Machine Learning Algorithm for the Scansion of Old Saxon Poetry
Several scholars designed tools to perform the automatic scansion of poetry in many languages, but none of these tools
deal with Old Saxon or Old English. This project aims to be a first attempt to create a tool for these languages. We
implemented a Bidirectional Long Short-Term Memory (BiLSTM) model to perform the automatic scansion of Old Saxon
and Old English poems. Since this model uses supervised learning, we manually annotated the Heliand manuscript, and
we used the resulting corpus as labeled dataset to train the model. The evaluation of the performance of the algorithm
reached a 97% for the accuracy and a 99% of weighted average for precision, recall and F1 Score. In addition, we tested
the model with some verses from the Old Saxon Genesis and some from The Battle of Brunanburh, and we observed that
the model predicted almost all Old Saxon metrical patterns correctly misclassified the majority of the Old English input
verses
Evaluating FAIR Digital Object and Linked Data as distributed object systems
FAIR Digital Object (FDO) is an emerging concept that is highlighted by
European Open Science Cloud (EOSC) as a potential candidate for building a
ecosystem of machine-actionable research outputs. In this work we
systematically evaluate FDO and its implementations as a global distributed
object system, by using five different conceptual frameworks that cover
interoperability, middleware, FAIR principles, EOSC requirements and FDO
guidelines themself.
We compare the FDO approach with established Linked Data practices and the
existing Web architecture, and provide a brief history of the Semantic Web
while discussing why these technologies may have been difficult to adopt for
FDO purposes. We conclude with recommendations for both Linked Data and FDO
communities to further their adaptation and alignment.Comment: 40 pages, submitted to PeerJ C
Enriching information extraction pipelines in clinical decision support systems
Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01[Resumo] Os estudos sanitarios de múltiples centros son importantes para aumentar a repercusión dos resultados da investigación médica debido ao número de suxeitos que poden participar neles. Para simplificar a execución destes estudos, o proceso de intercambio de datos debería ser sinxelo, por exemplo, mediante o uso de bases de datos interoperables. Con todo, a consecución desta interoperabilidade segue sendo
un tema de investigación en curso, sobre todo debido aos problemas de gobernanza e privacidade dos datos. Na primeira fase deste traballo, propoñemos varias metodoloxías para optimizar os procesos de estandarización das bases de datos sanitarias. Este
traballo centrouse na estandarización de fontes de datos heteroxéneas nun esquema de datos estándar, concretamente o OMOP CDM, que foi desenvolvido e promovido
pola comunidade OHDSI. Validamos a nosa proposta utilizando conxuntos de datos de pacientes con enfermidade de Alzheimer procedentes de distintas institucións.
Na seguinte etapa, co obxectivo de enriquecer a información almacenada nas bases de datos de OMOP CDM, investigamos solucións para extraer conceptos clínicos de narrativas non estruturadas, utilizando técnicas de recuperación de información e
de procesamento da linguaxe natural. A validación realizouse a través de conxuntos de datos proporcionados en desafíos científicos, concretamente no National NLP Clinical Challenges(n2c2). Na etapa final, propuxémonos simplificar a execución de
protocolos de estudos provenientes de múltiples centros, propoñendo solucións novas para perfilar, publicar e facilitar o descubrimento de bases de datos. Algunhas das solucións desenvolvidas están a utilizarse actualmente en tres proxectos europeos
destinados a crear redes federadas de bases de datos de saúde en toda Europa.[Resumen] Los estudios sanitarios de múltiples centros son importantes para aumentar la repercusión de los resultados de la investigación médica debido al número de sujetos que pueden participar en ellos. Para simplificar la ejecución de estos estudios, el proceso de intercambio de datos debería ser sencillo, por ejemplo, mediante el uso de bases de datos interoperables. Sin embargo, la consecución de esta interoperabilidad
sigue siendo un tema de investigación en curso, sobre todo debido a los problemas de gobernanza y privacidad de los datos. En la primera fase de este trabajo, proponemos varias metodologías para optimizar los procesos de estandarización de las
bases de datos sanitarias. Este trabajo se centró en la estandarización de fuentes de datos heterogéneas en un esquema de datos estándar, concretamente el OMOP CDM, que ha sido desarrollado y promovido por la comunidad OHDSI. Validamos nuestra propuesta utilizando conjuntos de datos de pacientes con enfermedad de Alzheimer procedentes de distintas instituciones. En la siguiente etapa, con el objetivo de enriquecer la información almacenada en las bases de datos de OMOP CDM, hemos investigado soluciones para extraer conceptos clínicos de narrativas no estructuradas, utilizando técnicas de recuperación de información y de procesamiento del lenguaje natural. La validación se realizó a través de conjuntos de datos proporcionados en desafíos científicos, concretamente en el National NLP Clinical Challenges (n2c2). En la etapa final, nos propusimos simplificar la ejecución de protocolos de estudios provenientes de múltiples centros, proponiendo soluciones novedosas para perfilar, publicar y facilitar el descubrimiento de bases de datos. Algunas de las soluciones desarrolladas se están utilizando actualmente en tres proyectos europeos destinados a crear redes federadas de bases de datos de salud en toda Europa.[Abstract] Multicentre health studies are important to increase the impact of medical research
findings due to the number of subjects that they are able to engage. To simplify the execution of these studies, the data-sharing process should be effortless, for instance, through the use of interoperable databases. However, achieving this interoperability is still an ongoing research topic, namely due to data governance and privacy issues. In the first stage of this work, we propose several methodologies to optimise the harmonisation pipelines of health databases. This work was focused on harmonising heterogeneous data sources into a standard data schema, namely the OMOP CDM which has been developed and promoted by the OHDSI community. We validated our proposal using data sets of Alzheimer’s disease patients from distinct institutions. In the following stage, aiming to enrich the information stored in OMOP CDM databases, we have investigated solutions to extract clinical concepts from unstructured narratives, using information retrieval and natural language processing
techniques. The validation was performed through datasets provided in scientific challenges, namely in the National NLP Clinical Challenges (n2c2). In the final stage, we aimed to simplify the protocol execution of multicentre studies, by proposing novel solutions for profiling, publishing and facilitating the discovery of databases. Some of the developed solutions are currently being used in three European projects
aiming to create federated networks of health databases across Europe
Supporting complex workflows for data-intensive discovery reliably and efficiently
Scientific workflows have emerged as well-established pillars of large-scale computational science and appeared as torchbearers to formalize and structure a massive amount of complex heterogeneous data and accelerate scientific progress. Scientists of diverse domains can analyze their data by constructing scientific workflows as a useful paradigm to manage complex scientific computations. A workflow can analyze terabyte-scale datasets, contain numerous individual tasks, and coordinate between heterogeneous tasks with the help of scientific workflow management systems (SWfMSs). However, even for expert users, workflow creation is a complex task due to the dramatic growth of tools and data heterogeneity. Scientists are now more willing to publicly share scientific datasets and analysis pipelines in the interest of open science. As sharing of research data and resources increases in scientific communities, scientists can reuse existing workflows shared in several workflow repositories. Unfortunately, several challenges can prevent scientists from reusing those workflows, which hurts the purpose of the community-oriented knowledge base. In this thesis, we first identify the repositories that scientists use to share and reuse scientific workflows. Among several repositories, we find Galaxy repositories have numerous workflows, and Galaxy is the mostly used SWfMS. After selecting the Galaxy repositories, we attempt to explore the workflows and encounter several challenges in reusing them. We classify the reusability status (reusable/nonreusable). Based on the effort level, we further categorize the reusable workflows (reusable without modification, easily reusable, moderately difficult to reuse, and difficult to reuse). Upon failure, we record the associated challenges that prevent reusability. We also list the actions upon success. The challenges preventing reusability include tool upgrading, tool support unavailability, design flaws, incomplete workflows, failure to load a workflow, etc. We need to perform several actions to overcome the challenges. The actions include identifying proper input datasets, updating/upgrading tools, finding alternative tools support for obsolete tools, debugging to find the issue creating tools and connections and solving them, modifying tools connections, etc. Such challenges and our action list offer guidelines to future workflow composers to create better workflows with enhanced reusability. A SWfMS stores provenance data at different phases of a workflow life cycle, which can help workflow construction. This provenance data allows reproducibility and knowledge reuse in the scientific community. But, this provenance information is usually many times larger than the workflow and input data, and managing provenance data is growing in complexity with large-scale applications. In our second study, we document the challenges of provenance management and reuse in e-science, focusing primarily on scientific workflow approaches by exploring different SWfMSs and provenance management systems. We also investigate the ways to overcome the challenges. Creating a workflow is difficult but essential for data-intensive complex analysis, and the existing workflows have several challenges to be reused, so in our third study, we build a recommendation system to recommend tool(s) using machine learning approaches to help scientists create optimal, error-free, and efficient workflows by using existing reusable workflows in Galaxy workflow repositories. The findings from our studies and proposed techniques have the potential to simplify the data-intensive analysis, ensuring reliability and efficiency
Advanced Computer Technologies for Integrated Agro-Hydrologic Systems Modeling: Coupled Crop and Hydrologic Models for Agricultural Intensification Impacts Assessment
Coupling hydrologic and crop models is increasingly becoming an important task when addressing agro-hydrologic systems studies. Either for resources conservation or cropping systems improvement, the complex interactions between hydrologic regime and crop management components requires an integrative approach in order to be fully understood. Nevertheless, the literature offers limited resources on models’ coupling that targets environmental scientists. Indeed, major of guides are are destined primarily for computer specialists and make them hard to encompass and apply. To address this gap, we present an extensive research to crop and hydrologic models coupling that targets earth agro-hydrologic modeling studies in its integrative complexity. The primary focus is to understand the relationship between agricultural intensification and its impacts on hydrologic balance. We provided documentations, classifications, applications and references of the available technologies and trends of development. We applied the results of the investigation by coupling the DREAM hydrologic model with DSSAT crop model. Both models were upgraded either on their code source (DREAM) or operational base (DSSAT) for interoperability and parallelization. The resulting model operates at a grid base and daily step. The model is applied southern Italy to analyze the effect of fertilizer application on runoff generation between 2000 and 2013. The results of the study show a significant impacts of nitrogen application on water yield. Indeed, nearly 71.5 thousand cubic-meter of rain water for every kilogram of nitrogen and per hectare is lost as a reduction of runoff coefficient. Furthermore, a significant correlation between the nitrogen applications amount and runoff is found at a yearly basis with Pearson’s coefficient of 0.93
- …