226 research outputs found

    SPARQLGX in Action: Efficient Distributed Evaluation of SPARQL with Apache Spark

    Get PDF
    International audienceWe demonstrate SPARQLGX: our implementation of a distributed sparql evaluator. We show that sparqlgx makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures

    Low-Latency Sliding Window Algorithms for Formal Languages

    Get PDF
    Low-latency sliding window algorithms for regular and context-free languages are studied, where latency refers to the worst-case time spent for a single window update or query. For every regular language L it is shown that there exists a constant-latency solution that supports adding and removing symbols independently on both ends of the window (the so-called two-way variable-size model). We prove that this result extends to all visibly pushdown languages. For deterministic 1-counter languages we present a ?(log n) latency sliding window algorithm for the two-way variable-size model where n refers to the window size. We complement these results with a conditional lower bound: there exists a fixed real-time deterministic context-free language L such that, assuming the OMV (online matrix vector multiplication) conjecture, there is no sliding window algorithm for L with latency n^(1/2-?) for any ? > 0, even in the most restricted sliding window model (one-way fixed-size model). The above mentioned results all refer to the unit-cost RAM model with logarithmic word size. For regular languages we also present a refined picture using word sizes ?(1), ?(log log n), and ?(log n)

    A Circuit-Based Approach to Efficient Enumeration

    Get PDF
    We study the problem of enumerating the satisfying valuations of a circuit while bounding the delay, i.e., the time needed to compute each successive valuation. We focus on the class of structured d-DNNF circuits originally introduced in knowledge compilation, a sub-area of artificial intelligence. We propose an algorithm for these circuits that enumerates valuations with linear preprocessing and delay linear in the Hamming weight of each valuation. Moreover, valuations of constant Hamming weight can be enumerated with linear preprocessing and constant delay. Our results yield a framework for efficient enumeration that applies to all problems whose solutions can be compiled to structured d-DNNFs. In particular, we use it to recapture classical results in database theory, for factorized database representations and for MSO evaluation. This gives an independent proof of constant-delay enumeration for MSO formulae with first-order free variables on bounded-treewidth structures

    Extensive Gene Remodeling in the Viral World: New Evidence for Nongradual Evolution in the Mobilome Network

    Get PDF
    International audienceComplex nongradual evolutionary processes such as gene remodeling are difficult to model, to visualize, and to investigate systematically. Despite these challenges, the creation of composite (or mosaic) genes by combination of genetic segments from unrelated gene families was established as an important adaptive phenomena in eukaryotic genomes. In contrast, almost no general studies have been conducted to quantify composite genes in viruses. Although viral genome mosaicism has been well-described, the extent of gene mosaicism and its rules of emergence remain largely unexplored. Applying methods from graph theory to inclusive similarity networks, and using data from more than 3,000 complete viral genomes, we provide the first demonstration that composite genes in viruses are 1) functionally biased, 2) involved in key aspects of the arm race between cells and viruses, and 3) can be classified into two distinct types of composite genes in all viral classes. Beyond the quantification of the widespread recombination of genes among different viruses of the same class, we also report a striking sharing of genetic information between viruses of different classes and with different nucleic acid types. This latter discovery provides novel evidence for the existence of a large and complex mobilome network, which appears partly bound by the sharing of genetic information and by the formation of composite genes between mobile entities with different genetic material. Considering that there are around 10E31 viruses on the planet, gene remodeling appears as a hugely significant way of generating and moving novel sequences between different kinds of organisms on Earth

    Ranked Enumeration of MSO Logic on Words

    Get PDF
    In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words

    Tailored vertex ordering for faster triangle listing in large graphs

    Full text link
    Listing triangles is a fundamental graph problem with many applications, and large graphs require fast algorithms. Vertex ordering allows the orientation of edges from lower to higher vertex indices, and state-of-the-art triangle listing algorithms use this to accelerate their execution and to bound their time complexity. Yet, only basic orderings have been tested. In this paper, we show that studying the precise cost of algorithms instead of their bounded complexity leads to faster solutions. We introduce cost functions that link ordering properties with the running time of a given algorithm. We prove that their minimization is NP-hard and propose heuristics to obtain new orderings with different trade-offs between cost reduction and ordering time. Using datasets with up to two billion edges, we show that our heuristics accelerate the listing of triangles by an average of 38% when the ordering is already given as an input, and 16% when the ordering time is included.Comment: 11 pages, 4 figures. Open-source C++ code available at: https://github.com/lecfab/vol

    Low-Latency Sliding Window Algorithms for Formal Languages

    Get PDF
    Low-latency sliding window algorithms for regular and context-free languages are studied, where latency refers to the worst-case time spent for a single window update or query. For every regular language LL it is shown that there exists a constant-latency solution that supports adding and removing symbols independently on both ends of the window (the so-called two-way variable-size model). We prove that this result extends to all visibly pushdown languages. For deterministic 1-counter languages we present a O(logn)\mathcal{O}(\log n) latency sliding window algorithm for the two-way variable-size model where nn refers to the window size. We complement these results with a conditional lower bound: there exists a fixed real-time deterministic context-free language LL such that, assuming the OMV (online matrix vector multiplication) conjecture, there is no sliding window algorithm for LL with latency n1/2ϵn^{1/2-\epsilon} for any ϵ>0\epsilon>0, even in the most restricted sliding window model (one-way fixed-size model). The above mentioned results all refer to the unit-cost RAM model with logarithmic word size. For regular languages we also present a refined picture using word sizes O(1)\mathcal{O}(1), O(loglogn)\mathcal{O}(\log\log n), and O(logn)\mathcal{O}(\log n).Comment: A short version will be presented at the conference FSTTCS 202

    Accessibility and quality of drug company disclosures of payments to healthcare professionals and organisations in 37 countries: A European policy review

    Get PDF
    Objectives: To examine the accessibility and quality of drug company payment data in Europe.Design: Comparative policy review of payment data in countries with different regulatory approaches to disclosure.Setting; 37 European countries.Participants: European Federation of Pharmaceutical Industries and Associations, its trade group and their drug company members; eurosfordocs.eu, an independent database integrating payments disclosed by companies and trade groups; regulatory bodies overseeing payment disclosure.Main outcome measures: Regulatory approaches to disclosure (self-regulation, public regulation, combination of the two); data accessibility (format, structure, searchability, customisable summary statistics, downloadability) and quality (spectrum of disclosed characteristics, payment aggregation, inclusion of taxes, recipient or donor identifiers).Results: Of 30 countries with self-regulation, five had centralised databases, with Disclosure UK displaying the highest accessibility and quality. In 23 of the remaining countries with self-regulation and available data, disclosures were published in the portable document format (PDF) on individual company websites, preventing the public from understanding payment patterns. Eurosfordocs.eu had greater accessibility than any industry-run database, but the match between the value of payments integrated in eurosfordocs.eu and summarised separately by industry in seven countries ranged between 56% and 100% depending on country. Eurosfordocs.eu shared quality shortcomings with the underlying industry data, including ambiguities in identifying payments and their recipients. Public regulation was found in 15 countries, used either alone (3), in combination (4) or in parallel with (8) self-regulation. Of these countries, 13 established centralised databases with widely ranging accessibility and quality, and sharing some shortcomings with the industry-run databases. The French database, Transparence Santé, had the highest accessibility and quality, exceeding that of Disclosure UK.Conclusions: The accessibility and quality of payment data disclosed in European countries are typically low, hindering investigation of financial conflicts of interest. Some improvements are straightforward but reaching the standards characterising the widely researched US Open Payments database requires major regulatory change

    Good practices for clinical data warehouse implementation: a case study in France

    Full text link
    Real World Data (RWD) bears great promises to improve the quality of care. However, specific infrastructures and methodologies are required to derive robust knowledge and brings innovations to the patient. Drawing upon the national case study of the 32 French regional and university hospitals governance, we highlight key aspects of modern Clinical Data Warehouses (CDWs): governance, transparency, types of data, data reuse, technical tools, documentation and data quality control processes. Semi-structured interviews as well as a review of reported studies on French CDWs were conducted in a semi-structured manner from March to November 2022. Out of 32 regional and university hospitals in France, 14 have a CDW in production, 5 are experimenting, 5 have a prospective CDW project, 8 did not have any CDW project at the time of writing. The implementation of CDW in France dates from 2011 and accelerated in the late 2020. From this case study, we draw some general guidelines for CDWs. The actual orientation of CDWs towards research requires efforts in governance stabilization, standardization of data schema and development in data quality and data documentation. Particular attention must be paid to the sustainability of the warehouse teams and to the multi-level governance. The transparency of the studies and the tools of transformation of the data must improve to allow successful multi-centric data reuses as well as innovations in routine care.Comment: 16 page

    The SPARQLGX System for Distributed Evaluation of SPARQL Queries

    Get PDF
    SPARQL is the W3C standard query language for querying data expressed in the Resource Description Framework (RDF). The increasing amounts of data available in the RDF format raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators. In this context, we propose SPARQLGX: an implementation of a distributed RDF datastore based on Apache Spark. SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries efficiently. SPARQLGX relies on an automated translation of SPARQL queries into optimized executable Spark code. We show that SPARQLGX makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how SPARQLGX compares to state-of-the-art implementations and we show that our approach scales better than other systems in terms of supported dataset size. With its simple design, SPARQLGX represents an interesting alternative in several scenarios
    corecore