20 research outputs found
PerfXplain: Debugging MapReduce Job Performance
While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions. We formally define the notion of an explanation together with three
metrics, relevance, precision, and generality, that measure explanation
quality. We present the explanation-generation algorithm based on techniques
related to decision-tree building. We evaluate the approach on a log of past
executions on Amazon EC2, and show that our approach can generate quality
explanations, outperforming two naive explanation-generation methods.Comment: VLDB201
Believe It or Not: Adding Belief Annotations to Databases
We propose a database model that allows users to annotate data with belief
statements. Our motivation comes from scientific database applications where a
community of users is working together to assemble, revise, and curate a shared
data repository. As the community accumulates knowledge and the database
content evolves over time, it may contain conflicting information and members
can disagree on the information it should store. For example, Alice may believe
that a tuple should be in the database, whereas Bob disagrees. He may also
insert the reason why he thinks Alice believes the tuple should be in the
database, and explain what he thinks the correct tuple should be instead.
We propose a formal model for Belief Databases that interprets users'
annotations as belief statements. These annotations can refer both to the base
data and to other annotations. We give a formal semantics based on a fragment
of multi-agent epistemic logic and define a query language over belief
databases. We then prove a key technical result, stating that every belief
database can be encoded as a canonical Kripke structure. We use this structure
to describe a relational representation of belief databases, and give an
algorithm for translating queries over the belief database into standard
relational queries. Finally, we report early experimental results with our
prototype implementation on synthetic data.Comment: 17 pages, 10 figure
QueryVis: Logic-based diagrams help users understand complicated SQL queries faster
Understanding the meaning of existing SQL queries is critical for code
maintenance and reuse. Yet SQL can be hard to read, even for expert users or
the original creator of a query. We conjecture that it is possible to capture
the logical intent of queries in \emph{automatically-generated visual diagrams}
that can help users understand the meaning of queries faster and more
accurately than SQL text alone. We present initial steps in that direction with
visual diagrams that are based on the first-order logic foundation of SQL and
can capture the meaning of deeply nested queries. Our diagrams build upon a
rich history of diagrammatic reasoning systems in logic and were designed using
a large body of human-computer interaction best practices: they are
\emph{minimal} in that no visual element is superfluous; they are
\emph{unambiguous} in that no two queries with different semantics map to the
same visualization; and they \emph{extend} previously existing visual
representations of relational schemata and conjunctive queries in a natural
way. An experimental evaluation involving 42 users on Amazon Mechanical Turk
shows that with only a 2--3 minute static tutorial, participants could
interpret queries meaningfully faster with our diagrams than when reading SQL
alone. Moreover, we have evidence that our visual diagrams result in
participants making fewer errors than with SQL. We believe that more regular
exposure to diagrammatic representations of SQL can give rise to a
\emph{pattern-based} and thus more intuitive use and re-use of SQL. All details
on the experimental study, the evaluation stimuli, raw data, and analyses, and
source code are available at https://osf.io/mycr2Comment: Full version of paper appearing in SIGMOD 202
Leveraging Usage History to Enhance Database Usability
Thesis (Ph.D.)--University of Washington, 2012More so than ever before, large datasets are being collected and analyzed throughout a variety of disciplines. Examples include social networking data, software logs, scientific data, web clickstreams, sensor network data, and more. As such, there are a wide range of users interacting with these large datasets, ranging from scientists, to data analysts, to sociologists, to market researchers. These users are experts in their domain and understand their data extensively, but are not database experts. Database systems are scalable and efficient, but are notoriously difficult to use. In this work, we aim to address this challenge, by leveraging usage history. From usage history, we can extract knowledge about the multitude of users' experiences with the database. Consequently, this knowledge allows us to build smarter systems that better cater to the users' needs. We address different aspects of the database usability problem and develop three complementary systems. First, we aim to ease the query formulation process. We build the SnipSuggest system, which is an autocompletion tool for SQL queries. It provides on-the-go, context-aware assistance in the query composition process. The second challenge we address is that of query debugging. Query debugging is a painful process in part because executing queries directly over a large database is slow while manually creating small test databases is burdensome to users. We present the second contribution of this dissertation: SIQ (Sample-based Interactive Querying). SIQ is a system for automatically selecting a `good' small sample of the underlying input database to allow queries to execute in realtime, thus supporting interactive query debugging. Third, once a user has successfully constructed the right query, they must execute it. However, executing and understanding the performance of a query on a large-scale, parallel database system can be difficult even for experts. Our third contribution, PerfXplain, is a tool for explaining the performance of a MapReduce job running on a shared-nothing cluster. Namely, it aims to answer the question of why one job was slower than another. PerfXplain analyzes the MapReduce log files from past runs to better understand the correlation between different properties of pairs of job and their relative runtimes
Towards correcting input data errors probabilistically using integrity constraints
Mobile and pervasive applications frequently rely on devices such as RFID antennas or sensors (light, temperature, motion) to provide them information about the physical world. These devices, however, are unreliable. They produce streams of information where portions of data may be missing, duplicated, or erroneous. Current state of the art is to correct errors locally (e.g., range constraints for temperature readings) or use spatial/temporal correlations (e.g., smoothing temperature readings). However, errors are often apparent only in a global setting, e.g., missed readings of objects that are known to be present, or exit readings from a parking garage without matching entry readings. In this paper, we present StreamClean, a system for correcting input data errors automatically using application defined global integrity constraints. Because it is frequently impossible to make corrections with certainty, we propose a probabilistic approach, where the system assigns to each input tuple the probability that it is correct. We show that StreamClean handles a large class of input data errors, and corrects them sufficiently fast to keep-up with input rates of many mobile and pervasive applications. We also show that the probabilities assigned by StreamClean correspond to a user’s intuitive notion of correctness
Probabilistic Event Extraction from RFID Data
Abstract — We present PEEX, a system that enables applications to define and extract meaningful probabilistic high-level events from RFID data. PEEX effectively copes with errors in the data and the inherent ambiguity of event extraction