14,358 research outputs found

    Engineering Crowdsourced Stream Processing Systems

    Full text link
    A crowdsourced stream processing system (CSP) is a system that incorporates crowdsourced tasks in the processing of a data stream. This can be seen as enabling crowdsourcing work to be applied on a sample of large-scale data at high speed, or equivalently, enabling stream processing to employ human intelligence. It also leads to a substantial expansion of the capabilities of data processing systems. Engineering a CSP system requires the combination of human and machine computation elements. From a general systems theory perspective, this means taking into account inherited as well as emerging properties from both these elements. In this paper, we position CSP systems within a broader taxonomy, outline a series of design principles and evaluation metrics, present an extensible framework for their design, and describe several design patterns. We showcase the capabilities of CSP systems by performing a case study that applies our proposed framework to the design and analysis of a real system (AIDR) that classifies social media messages during time-critical crisis events. Results show that compared to a pure stream processing system, AIDR can achieve a higher data classification accuracy, while compared to a pure crowdsourcing solution, the system makes better use of human workers by requiring much less manual work effort

    DataWash : an advanced snowflake data quality tool powered by Snowpark

    Get PDF
    La creixent necessitat de precisió i completesa de dades en les organitzacions actuals ha posat en relleu la importància de la gestió de la qualitat de les dades. Per a fer front a aquesta necessitat, DataWash ha sorgit com una eina avançada de qualitat de dades impulsada per Snowpark que proporciona a les organitzacions una solució integral per millorar la qualitat de les seves dades a Snowflake. Aquesta eina proporciona una execució per lots programada i capacitats d'anàlisi ad hoc / sota demanda, generant un informe de Power BI que permet realitzar una visualització de les mètriques que reflexen la qualitat de les dades. El conjunt de mòduls proporcionat per DataWash permet gestionar una àmplia gamma de problemes relacionats amb la qualitat de les dades, com ara la duplicació de dades, inconsistències i el compliment dels estàndards de dades. Per tant, aquesta tesi de llicenciatura té com a objectiu desenvolupar DataWash com una eina avançada de qualitat de dades amb la finalitat d'ajudar les organitzacions a millorar la precisió i fiabilitat de les seves dades, explorant les seves capacitats i rendibilitat, avaluant el seu rendiment utilitzant conjunts de dades del món real i comparant-la amb les principals eines de qualitat de dades del mercat.La creciente necesidad de precisión e integridad de los datos en las organizaciones actuales ha puesto de relieve la importancia de la gestión de la calidad de los datos. Para hacer frente a esta necesidad, DataWash ha surgido como una herramienta avanzada de calidad de datos impulsada por Snowpark que proporciona a las organizaciones una solución integral para mejorar la calidad de sus datos en Snowflake. Esta herramienta proporciona ejecución programada por lotes y capacidades de análisis ad hoc / bajo demanda, generando un informe Power BI para una fácil visualización de las métricas de calidad de datos. El conjunto de módulos proporcionados por DataWash puede manejar una amplia gama de problemas de calidad de datos, tales como la duplicación de datos, inconsistencias, y el cumplimiento de las normas de datos. Por lo tanto, esta tesis de licenciatura tiene como objetivo desarrollar DataWash como una herramienta avanzada de calidad de datos con el fin de ayudar a las organizaciones a mejorar la exactitud y fiabilidad de sus datos mediante la exploración de sus capacidades y rentabilidad, la evaluación de su rendimiento utilizando conjuntos de datos del mundo real y su comparación con las principales herramientas de calidad de datos del mercado.The increasing need for data accuracy and completeness in today's organizations has highlighted the importance of data quality management. To address this need, DataWash has emerged as an advanced data quality tool powered by Snowpark that provides organizations with a comprehensive solution for improving the quality of their data in Snowflake. This tool provides scheduled batch execution and ad hoc / on-demand analysis capabilities, generating a Power BI report for easy visualization of data quality metrics. The suite of modules provided by DataWash can handle a wide range of data quality issues, such as data duplication, inconsistencies, and compliance with data standards. In essence, this bachelor's thesis aims to develop DataWash as an advanced data quality tool in order to help organizations improve the accuracy and reliability of their data by exploring its capabilities and cost-effectiveness, evaluating its performance using real-world datasets, and benchmarking it against leading data quality tools on the market

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    RDF-TR: Exploiting structural redundancies to boost RDF compression

    Get PDF
    The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate RDF-Tr with two RDF compressors, HDT and k2-triples. Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques

    Keeping Authorities "Honest or Bust" with Decentralized Witness Cosigning

    Get PDF
    The secret keys of critical network authorities - such as time, name, certificate, and software update services - represent high-value targets for hackers, criminals, and spy agencies wishing to use these keys secretly to compromise other hosts. To protect authorities and their clients proactively from undetected exploits and misuse, we introduce CoSi, a scalable witness cosigning protocol ensuring that every authoritative statement is validated and publicly logged by a diverse group of witnesses before any client will accept it. A statement S collectively signed by W witnesses assures clients that S has been seen, and not immediately found erroneous, by those W observers. Even if S is compromised in a fashion not readily detectable by the witnesses, CoSi still guarantees S's exposure to public scrutiny, forcing secrecy-minded attackers to risk that the compromise will soon be detected by one of the W witnesses. Because clients can verify collective signatures efficiently without communication, CoSi protects clients' privacy, and offers the first transparency mechanism effective against persistent man-in-the-middle attackers who control a victim's Internet access, the authority's secret key, and several witnesses' secret keys. CoSi builds on existing cryptographic multisignature methods, scaling them to support thousands of witnesses via signature aggregation over efficient communication trees. A working prototype demonstrates CoSi in the context of timestamping and logging authorities, enabling groups of over 8,000 distributed witnesses to cosign authoritative statements in under two seconds.Comment: 20 pages, 7 figure
    corecore