42 research outputs found

    Effective and Efficient Similarity Search in Scientific Workflow Repositories

    Get PDF
    International audienceScientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate worflkow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for similarity search. Effective similarity search requires both high quality algorithms for the comparison of scientific workflows and efficient strategies for indexing, searching, and ranking of search results. Yet, the graph structure of scientific workflows poses severe challenges to each of these steps. Here, we present a complete system for effective and efficient similarity search in scientific workflow repositories, based on the Layer Decompositon approach to scientific workflow comparison. Layer Decompositon specifically accounts for the directed dataflow underlying scientific workflows and, compared to other state-of-the-art methods, delivers best results for similarity search at comparably low runtimes. Stacking Layer Decomposition with even faster, structure-agnostic approaches allows us to use proven, off-the-shelf tools for workflow indexing to further reduce runtimes and scale similarity search to sizes of current repositories

    Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity

    Get PDF
    International audienceScientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workflow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for effective similarity search. Here, we present a novel and intuitive workflow similarity measure that is based on layer decomposition. Layer decomposition accounts for the directed dataflow underlying scientific workflows, a property which has not been adequately considered in previous methods. We comparatively evaluate our algorithm using a gold standard for 24 query workflows from a repository of almost 1500 scientific workflows, and show that it a) delivers the best results for similarity search, b) has a much lower runtime than other, often highly complex competitors in structure-aware workflow comparison, and c) can be stacked easily with even faster, structure-agnostic approaches to further reduce runtime while retaining result quality

    Could dementia be detected from UK primary care patients’ records by simple automated methods earlier than by the treating physician? A retrospective case-control study

    Get PDF
    Background: Timely diagnosis of dementia is a policy priority in the United Kingdom (UK). Primary care physicians receive incentives to diagnose dementia;, however, 33% of patients are still not receiving a diagnosis. We explored automating early detection of dementia using data from patients’ electronic health records (EHRs). We investigated: a) how early a machine-learning model could accurately identify dementia before the physician;, b) if models could be tuned for dementia subtype;, and c) what the best clinical features were for achieving detection. Methods: Using EHRs from Clinical Practice Research Datalink in a case-control design, we selected patients aged >65y with a diagnosis of dementia recorded 2000-2012 (cases) and matched them 1:1 to controls; we also identified subsets of Alzheimer’s and vVascular dementia patients. Using 77 coded concepts recorded in the 5 years before diagnosis, we trained random forest classifiers, and evaluated models using Area Under the Receiver Operating Characteristic Curve (AUC). We examined models by: year prior to diagnosis, subtype, and the most important features contributing to classification. Results: 95,202 patients (median age 83y; 64.8% female) were included (50% dementia cases). Classification of dementia cases and controls was poor 2-5 years prior to physician-recorded diagnosis (AUC range 0.55-0.65) but good in the year before (AUC: 0.84). Features indicating increasing cognitive and physical frailty dominated models 2-5 years before diagnosis; in the final year, initiation of the dementia diagnostic pathway (symptoms, screening and referral) explained the sudden increase in accuracy. No substantial differences were seen between all-cause dementia and subtypes. Conclusions: Automated detection of dementia earlier than the treating physician may be problematic, if using only primary care data. Future work should investigate more complex modelling, benefits of linking multiple sources of healthcare data and monitoring devices, or contextualising the algorithm to those cases that the GP would need to investigate

    Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)

    Get PDF
    © 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow

    An APRI+ALBI Based Multivariable Model as Preoperative Predictor for Posthepatectomy Liver Failure.

    Get PDF
    OBJECTIVE AND BACKGROUND Clinically significant posthepatectomy liver failure (PHLF B+C) remains the main cause of mortality after major hepatic resection. This study aimed to establish an APRI+ALBI, aspartate aminotransferase to platelet ratio (APRI) combined with albumin-bilirubin grade (ALBI), based multivariable model (MVM) to predict PHLF and compare its performance to indocyanine green clearance (ICG-R15 or ICG-PDR) and albumin-ICG evaluation (ALICE). METHODS 12,056 patients from the National Surgical Quality Improvement Program (NSQIP) database were used to generate a MVM to predict PHLF B+C. The model was determined using stepwise backwards elimination. Performance of the model was tested using receiver operating characteristic curve analysis and validated in an international cohort of 2,525 patients. In 620 patients, the APRI+ALBI MVM, trained in the NSQIP cohort, was compared with MVM's based on other liver function tests (ICG clearance, ALICE) by comparing the areas under the curve (AUC). RESULTS A MVM including APRI+ALBI, age, sex, tumor type and extent of resection was found to predict PHLF B+C with an AUC of 0.77, with comparable performance in the validation cohort (AUC 0.74). In direct comparison with other MVM's based on more expensive and time-consuming liver function tests (ICG clearance, ALICE), the APRI+ALBI MVM demonstrated equal predictive potential for PHLF B+C. A smartphone application for calculation of the APRI+ALBI MVM was designed. CONCLUSION Risk assessment via the APRI+ALBI MVM for PHLF B+C increases preoperative predictive accuracy and represents an universally available and cost-effective risk assessment prior to hepatectomy, facilitated by a freely available smartphone app

    Similarity measures for scientific workflows

    Get PDF
    In Laufe der letzten zehn Jahre haben Scientific Workflows als Werkzeug zur Erstellung von reproduzierbaren, datenverarbeitenden in-silico Experimenten an Aufmerksamkeit gewonnen, in die sowohl lokale Skripte und Anwendungen, als auch Web-Services eingebunden werden können. Über spezialisierte Online-Bibliotheken, sogenannte Repositories, können solche Workflows veröffentlicht und wiederverwendet werden. Mit zunehmender Größe dieser Repositories werden Ähnlichkeitsmaße für Scientific Workflows notwendig, etwa für Duplikaterkennung, Ähnlichkeitssuche oder Clustering von funktional ähnlichen Workflows. Die vorliegende Arbeit untersucht solche Ähnlichkeitsmaße für Scientific Workflows. Als erstes untersuchen wir ähnlichkeitsrelevante Eigenschaften von Scientific Workflows und identifizieren Charakteristika der Wiederverwendung ihrer Komponenten. Als zweites analysieren und reimplementieren wir existierende Lösungen für den Vergleich von Scientific Workflows entlang definierter Teilschritte des Vergleichsprozesses. Wir erstellen einen großen Gold-Standard Corpus von Workflowähnlichkeiten, der über 2400 Bewertungen für 485 Workflowpaare enthält, die von 15 Experten aus 6 Institutionen beigetragen wurden. Zum ersten Mal erlauben diese Vorarbeiten eine umfassende, vergleichende Evaluation verschiedener Ähnlichkeitsmaße für Scientific Workflows, in der wir einige vorige Ergebnisse bestätigen, andere aber revidieren. Als drittes stellen wir ein neue Methode für das Vergleichen von Scientific Workflows vor. Unsere Evaluation zeigt, dass diese neue Methode bessere und konsistentere Ergebnisse liefert und leicht mit anderen Ansätzen kombiniert werden kann, um eine weitere Qualitätssteigerung zu erreichen. Als viertes zweigen wir, wie die Resultate aus den vorangegangenen Schritten genutzt werden können, um aus Standardkomponenten eine Suchmaschine für schnelle, qualitativ hochwertige Ähnlichkeitssuche im Repositorymaßstab zu implementieren.Over the last decade, scientific workflows have gained attention as a valuable tool to create reproducible in-silico experiments. Specialized online repositories have emerged which allow such workflows to be shared and reused by the scientific community. With increasing size of these repositories, methods to compare scientific workflows regarding their functional similarity become a necessity. To allow duplicate detection, similarity search, or clustering, similarity measures for scientific workflows are an essential prerequisite. This thesis investigates similarity measures for scientific workflows. We carry out four consecutive research tasks: First, we closely investigate the relevant properties of scientific workflows regarding their similarity and identify characteristics of re-use of their components. Second, we review and dissect existing approaches to scientific workflow comparison into a defined set of subtasks necessary in the process of workflow comparison, and re-implement previous approaches to each subtask. We create a large gold-standard corpus of expert-ratings on workflow similarity, with more than 2400 ratings provided for 485 pairs of workflows by 15 workflow experts from 6 institutions. For the first time, this allows comprehensive, comparative evaluation of different scientific workflow similarity measures, confirming some previous findings, but rejecting others. Third, we propose and evaluate a novel method for scientific workflow comparison. We show that this novel method provides results of both higher quality and higher consistency than previous approaches, and can easily be stacked and ensembled with other approaches for still better performance and higher speed. Fourth, we show how our findings can be leveraged to implement a search engine using off-the-shelf tools that performs fast, high quality similarity search for scientific workflows at repository-scale, a premier area of application for similarity measures for scientific workflows

    Experiences from Developing the Domain-Specific Entity Search Engine GeneView

    No full text
    Abstract:.GeneView is a semantic search engine for the Life Sciences. Unlike traditional search engines, GeneView searches indexed documents not only at the textual (syntactic) level, but analyzes texts upon import to recognize and properly handle biomedical entities, relationships between those entities, and the structure of documents. This allows for a number of advanced features required to work effectively with scientific texts, such as precise search despite large numbers of synonyms and homonyms, entity disambiguation, ranking of documents by entity content, linking to structured knowledge about entities, user-friendly highlighting of recognized entities etc. As of now, GeneView indexes approximately ~21,4 million abstracts and ~358.000 full texts with more than 200 Million entities of 11 different types and more than 100,000 relationships of three different types. In this paper, we describe the architecture underlying the system with a focus on the complex pipeline of advanced NLP and information extraction tools necessary for achieving the above functionality. We also discuss open challenges in developing and maintaining a semantic search engine over a large (though not web-scale) corpus. 1

    DeepTable: a permutation invariant neural network for table orientation classification

    No full text
    Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one entity with the different attribute values present in the different rows; row tables are vice versa, and matrix tables give information on pairs of entities. In this paper, we address the problem of classifying a given table into one of the three layouts horizontal (for row tables), vertical (for column tables), and matrix. We describe DeepTable, a novel method based on deep neural networks designed for learning from sets. Contrary to previous state-of-the-art methods, this basis makes DeepTable invariant to the permutation of rows or columns, which is a highly desirable property as in most tables the order of rows and columns does not carry specific information.We evaluate our method using a silver standard corpus of 5500 tables extracted from biomedical articles where the layout was determined heuristically. DeepTable outperforms previous methods in both precision and recall on our corpus. In a second evaluation, we manually labeled a corpus of 300 tables and were able to confirm DeepTable to reach superior performance in the table layout classification task. The codes and resources introduced here are available at https://github.com/Marhabibi/DeepTable.Peer Reviewe

    (Re)Use in Public Scientific Workflow Repositories

    No full text
    International audienceScientific workflows have been introduced to enhance reproducibility, sharing and reuse of in-silico experiments. Faced with increasing numbers of workflows available in public repositories, users have a crucial need for assistance in workflow discovery. Identifying the functional elements shared between workflows and thus determining similarity between workflows is then a key point. In this paper, we present the results of a study we performed on 898 workflows from myExperiment. Our contribution is four fold: (i) we discuss the critical problem of identifying workflows and workflow elements, (ii) we provide detailed analysis about the frequencies of re-used elements across workflows, (iii) we consider, for the first time, the problem of cross-author reuse and (iv) we highlight characteristics shared between reused elements
    corecore