14,358 research outputs found
Engineering Crowdsourced Stream Processing Systems
A crowdsourced stream processing system (CSP) is a system that incorporates
crowdsourced tasks in the processing of a data stream. This can be seen as
enabling crowdsourcing work to be applied on a sample of large-scale data at
high speed, or equivalently, enabling stream processing to employ human
intelligence. It also leads to a substantial expansion of the capabilities of
data processing systems. Engineering a CSP system requires the combination of
human and machine computation elements. From a general systems theory
perspective, this means taking into account inherited as well as emerging
properties from both these elements. In this paper, we position CSP systems
within a broader taxonomy, outline a series of design principles and evaluation
metrics, present an extensible framework for their design, and describe several
design patterns. We showcase the capabilities of CSP systems by performing a
case study that applies our proposed framework to the design and analysis of a
real system (AIDR) that classifies social media messages during time-critical
crisis events. Results show that compared to a pure stream processing system,
AIDR can achieve a higher data classification accuracy, while compared to a
pure crowdsourcing solution, the system makes better use of human workers by
requiring much less manual work effort
DataWash : an advanced snowflake data quality tool powered by Snowpark
La creixent necessitat de precisió i completesa de dades en les organitzacions actuals ha posat en relleu la importància de la gestió de la qualitat de les dades. Per a fer front a aquesta necessitat, DataWash ha sorgit com una eina avançada de qualitat de dades impulsada per Snowpark que proporciona a les organitzacions una solució integral per millorar la qualitat de les seves dades a Snowflake. Aquesta eina proporciona una execució per lots programada i capacitats d'anàlisi ad hoc / sota demanda, generant un informe de Power BI que permet realitzar una visualització de les mètriques que reflexen la qualitat de les dades. El conjunt de mòduls proporcionat per DataWash permet gestionar una àmplia gamma de problemes relacionats amb la qualitat de les dades, com ara la duplicació de dades, inconsistències i el compliment dels estàndards de dades. Per tant, aquesta tesi de llicenciatura té com a objectiu desenvolupar DataWash com una eina avançada de qualitat de dades amb la finalitat d'ajudar les organitzacions a millorar la precisió i fiabilitat de les seves dades, explorant les seves capacitats i rendibilitat, avaluant el seu rendiment utilitzant conjunts de dades del món real i comparant-la amb les principals eines de qualitat de dades del mercat.La creciente necesidad de precisión e integridad de los datos en las organizaciones actuales ha puesto de relieve la importancia de la gestión de la calidad de los datos. Para hacer frente a esta necesidad, DataWash ha surgido como una herramienta avanzada de calidad de datos impulsada por Snowpark que proporciona a las organizaciones una solución integral para mejorar la calidad de sus datos en Snowflake. Esta herramienta proporciona ejecución programada por lotes y capacidades de análisis ad hoc / bajo demanda, generando un informe Power BI para una fácil visualización de las métricas de calidad de datos. El conjunto de módulos proporcionados por DataWash puede manejar una amplia gama de problemas de calidad de datos, tales como la duplicación de datos, inconsistencias, y el cumplimiento de las normas de datos. Por lo tanto, esta tesis de licenciatura tiene como objetivo desarrollar DataWash como una herramienta avanzada de calidad de datos con el fin de ayudar a las organizaciones a mejorar la exactitud y fiabilidad de sus datos mediante la exploración de sus capacidades y rentabilidad, la evaluación de su rendimiento utilizando conjuntos de datos del mundo real y su comparación con las principales herramientas de calidad de datos del mercado.The increasing need for data accuracy and completeness in today's organizations has highlighted the importance of data quality management. To address this need, DataWash has emerged as an advanced data quality tool powered by Snowpark that provides organizations with a comprehensive solution for improving the quality of their data in Snowflake. This tool provides scheduled batch execution and ad hoc / on-demand analysis capabilities, generating a Power BI report for easy visualization of data quality metrics. The suite of modules provided by DataWash can handle a wide range of data quality issues, such as data duplication, inconsistencies, and compliance with data standards. In essence, this bachelor's thesis aims to develop DataWash as an advanced data quality tool in order to help organizations improve the accuracy and reliability of their data by exploring its capabilities and cost-effectiveness, evaluating its performance using real-world datasets, and benchmarking it against leading data quality tools on the market
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
RDF-TR: Exploiting structural redundancies to boost RDF compression
The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate
RDF-Tr with two RDF compressors, HDT and k2-triples. Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques
Keeping Authorities "Honest or Bust" with Decentralized Witness Cosigning
The secret keys of critical network authorities - such as time, name,
certificate, and software update services - represent high-value targets for
hackers, criminals, and spy agencies wishing to use these keys secretly to
compromise other hosts. To protect authorities and their clients proactively
from undetected exploits and misuse, we introduce CoSi, a scalable witness
cosigning protocol ensuring that every authoritative statement is validated and
publicly logged by a diverse group of witnesses before any client will accept
it. A statement S collectively signed by W witnesses assures clients that S has
been seen, and not immediately found erroneous, by those W observers. Even if S
is compromised in a fashion not readily detectable by the witnesses, CoSi still
guarantees S's exposure to public scrutiny, forcing secrecy-minded attackers to
risk that the compromise will soon be detected by one of the W witnesses.
Because clients can verify collective signatures efficiently without
communication, CoSi protects clients' privacy, and offers the first
transparency mechanism effective against persistent man-in-the-middle attackers
who control a victim's Internet access, the authority's secret key, and several
witnesses' secret keys. CoSi builds on existing cryptographic multisignature
methods, scaling them to support thousands of witnesses via signature
aggregation over efficient communication trees. A working prototype
demonstrates CoSi in the context of timestamping and logging authorities,
enabling groups of over 8,000 distributed witnesses to cosign authoritative
statements in under two seconds.Comment: 20 pages, 7 figure
- …