338 research outputs found
Web-based multi-party computation with application to anonymous aggregate compensation analytics
We describe the definition, design, implementation, and deployment of a multi-party computation protocol and supporting web-based infrastructure. The protocol and infrastructure constitute a software application that allows groups of cooperating parties, such as companies or other organizations, to collect aggregate data for statistical analysis without revealing the data of individual participants. The application was developed specifically to support a Boston Women's Workforce Council (BWWC) study of the gender wage gap among employers within the Greater Boston Area. The application was deployed successfully to collect aggregate statistical data pertaining to compensation levels across genders and demographics at a number of participating organizations.We would like to acknowledge all the members of the Boston Women's Workforce Council (BWWC), and to thank in particular Christina M. Knowles and Katie A. Johnston, who led the effort to organize participants and deploy the protocol as part of the 100% Talent: The Boston Women's Compact effort [1, 2]. We would also like to acknowledge the Boston University Initiative on Cities, and in particular Executive Director Katherine Lusk, who brought this potential application of secure multi-party computation to our attention. Both the BWWC and the Initiative on Cities contributed funding to complete this work. We would also like to acknowledge the Hariri Institute at Boston University for contributing research and software development resources. Support was also provided in part by Smart-city Cloud-based Open Platform and Ecosystem (SCOPE), an NSF Division of Industrial Innovation and Partnerships PFI:BIC project under award #1430145, and by Modular Approach to Cloud Security (MACS), an NSF CISE CNS SaTC Frontier project under award #1414119
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Fault-Tolerant Aggregation: Flow-Updating Meets Mass-Distribution
Flow-Updating (FU) is a fault-tolerant technique that has proved to be
efficient in practice for the distributed computation of aggregate functions in
communication networks where individual processors do not have access to global
information. Previous distributed aggregation protocols, based on repeated
sharing of input values (or mass) among processors, sometimes called
Mass-Distribution (MD) protocols, are not resilient to communication failures
(or message loss) because such failures yield a loss of mass. In this paper, we
present a protocol which we call Mass-Distribution with Flow-Updating (MDFU).
We obtain MDFU by applying FU techniques to classic MD. We analyze the
convergence time of MDFU showing that stochastic message loss produces low
overhead. This is the first convergence proof of an FU-based algorithm. We
evaluate MDFU experimentally, comparing it with previous MD and FU protocols,
and verifying the behavior predicted by the analysis. Finally, given that MDFU
incurs a fixed deviation proportional to the message-loss rate, we adjust the
accuracy of MDFU heuristically in a new protocol called MDFU with Linear
Prediction (MDFU-LP). The evaluation shows that both MDFU and MDFU-LP behave
very well in practice, even under high rates of message loss and even changing
the input values dynamically.Comment: 18 pages, 5 figures, To appear in OPODIS 201
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Flaw Selection Strategies for Partial-Order Planning
Several recent studies have compared the relative efficiency of alternative
flaw selection strategies for partial-order causal link (POCL) planning. We
review this literature, and present new experimental results that generalize
the earlier work and explain some of the discrepancies in it. In particular, we
describe the Least-Cost Flaw Repair (LCFR) strategy developed and analyzed by
Joslin and Pollack (1994), and compare it with other strategies, including
Gerevini and Schubert's (1996) ZLIFO strategy. LCFR and ZLIFO make very
different, and apparently conflicting claims about the most effective way to
reduce search-space size in POCL planning. We resolve this conflict, arguing
that much of the benefit that Gerevini and Schubert ascribe to the LIFO
component of their ZLIFO strategy is better attributed to other causes. We show
that for many problems, a strategy that combines least-cost flaw selection with
the delay of separable threats will be effective in reducing search-space size,
and will do so without excessive computational overhead. Although such a
strategy thus provides a good default, we also show that certain domain
characteristics may reduce its effectiveness.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
- …