5,809 research outputs found
Data generator for evaluating ETL process quality
Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft
IIFA: Modular Inter-app Intent Information Flow Analysis of Android Applications
Android apps cooperate through message passing via intents. However, when
apps do not have identical sets of privileges inter-app communication (IAC) can
accidentally or maliciously be misused, e.g., to leak sensitive information
contrary to users expectations. Recent research considered static program
analysis to detect dangerous data leaks due to inter-component communication
(ICC) or IAC, but suffers from shortcomings with respect to precision,
soundness, and scalability. To solve these issues we propose a novel approach
for static ICC/IAC analysis. We perform a fixed-point iteration of ICC/IAC
summary information to precisely resolve intent communication with more than
two apps involved. We integrate these results with information flows generated
by a baseline (i.e. not considering intents) information flow analysis, and
resolve if sensitive data is flowing (transitively) through components/apps in
order to be ultimately leaked. Our main contribution is the first fully
automatic sound and precise ICC/IAC information flow analysis that is scalable
for realistic apps due to modularity, avoiding combinatorial explosion: Our
approach determines communicating apps using short summaries rather than
inlining intent calls, which often requires simultaneously analyzing all tuples
of apps. We evaluated our tool IIFA in terms of scalability, precision, and
recall. Using benchmarks we establish that precision and recall of our
algorithm are considerably better than prominent state-of-the-art analyses for
IAC. But foremost, applied to the 90 most popular applications from the Google
Playstore, IIFA demonstrated its scalability to a large corpus of real-world
apps. IIFA reports 62 problematic ICC-/IAC-related information flows via two or
more apps/components
Efficient Subgraph Matching on Billion Node Graphs
The ability to handle large scale graph data is crucial to an increasing
number of applications. Much work has been dedicated to supporting basic graph
operations such as subgraph matching, reachability, regular expression
matching, etc. In many cases, graph indices are employed to speed up query
processing. Typically, most indices require either super-linear indexing time
or super-linear indexing space. Unfortunately, for very large graphs,
super-linear approaches are almost always infeasible. In this paper, we study
the problem of subgraph matching on billion-node graphs. We present a novel
algorithm that supports efficient subgraph matching for graphs deployed on a
distributed memory store. Instead of relying on super-linear indices, we use
efficient graph exploration and massive parallel computing for query
processing. Our experimental results demonstrate the feasibility of performing
subgraph matching on web-scale graph data.Comment: VLDB201
Tupleware: Redefining Modern Analytics
There is a fundamental discrepancy between the targeted and actual users of
current analytics frameworks. Most systems are designed for the data and
infrastructure of the Googles and Facebooks of the world---petabytes of data
distributed across large cloud deployments consisting of thousands of cheap
commodity machines. Yet, the vast majority of users operate clusters ranging
from a few to a few dozen nodes, analyze relatively small datasets of up to a
few terabytes, and perform primarily compute-intensive operations. Targeting
these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed
at the challenges faced by the typical user. Tupleware's architecture brings
together ideas from the database, compiler, and programming languages
communities to create a powerful end-to-end solution for data analysis. We
propose novel techniques that consider the data, computations, and hardware
together to achieve maximum performance on a case-by-case basis. Our
experimental evaluation quantifies the impact of our novel techniques and shows
orders of magnitude performance improvement over alternative systems
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
New results on rewrite-based satisfiability procedures
Program analysis and verification require decision procedures to reason on
theories of data structures. Many problems can be reduced to the satisfiability
of sets of ground literals in theory T. If a sound and complete inference
system for first-order logic is guaranteed to terminate on T-satisfiability
problems, any theorem-proving strategy with that system and a fair search plan
is a T-satisfiability procedure. We prove termination of a rewrite-based
first-order engine on the theories of records, integer offsets, integer offsets
modulo and lists. We give a modularity theorem stating sufficient conditions
for termination on a combinations of theories, given termination on each. The
above theories, as well as others, satisfy these conditions. We introduce
several sets of benchmarks on these theories and their combinations, including
both parametric synthetic benchmarks to test scalability, and real-world
problems to test performances on huge sets of literals. We compare the
rewrite-based theorem prover E with the validity checkers CVC and CVC Lite.
Contrary to the folklore that a general-purpose prover cannot compete with
reasoners with built-in theories, the experiments are overall favorable to the
theorem prover, showing that not only the rewriting approach is elegant and
conceptually simple, but has important practical implications.Comment: To appear in the ACM Transactions on Computational Logic, 49 page
- …