36,503 research outputs found
Incorporating Road Networks into Territory Design
Given a set of basic areas, the territory design problem asks to create a
predefined number of territories, each containing at least one basic area, such
that an objective function is optimized. Desired properties of territories
often include a reasonable balance, compact form, contiguity and small average
journey times which are usually encoded in the objective function or formulated
as constraints. We address the territory design problem by developing graph
theoretic models that also consider the underlying road network. The derived
graph models enable us to tackle the territory design problem by modifying
graph partitioning algorithms and mixed integer programming formulations so
that the objective of the planning problem is taken into account. We test and
compare the algorithms on several real world instances
Efficient regularized isotonic regression with application to gene--gene interaction search
Isotonic regression is a nonparametric approach for fitting monotonic models
to data that has been widely studied from both theoretical and practical
perspectives. However, this approach encounters computational and statistical
overfitting issues in higher dimensions. To address both concerns, we present
an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic
regression based on recursively partitioning the covariate space through
solution of progressively smaller "best cut" subproblems. This creates a
regularized sequence of isotonic models of increasing model complexity that
converges to the global isotonic regression solution. The models along the
sequence are often more accurate than the unregularized isotonic regression
model because of the complexity control they offer. We quantify this complexity
control through estimation of degrees of freedom along the path. Success of the
regularized models in prediction and IRPs favorable computational properties
are demonstrated through a series of simulated and real data experiments. We
discuss application of IRP to the problem of searching for gene--gene
interactions and epistasis, and demonstrate it on data from genome-wide
association studies of three common diseases.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS504 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Improving Table Compression with Combinatorial Optimization
We study the problem of compressing massive tables within the
partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which
a table is partitioned by an off-line training procedure into disjoint
intervals of columns, each of which is compressed separately by a standard,
on-line compressor like gzip. We provide a new theory that unifies previous
experimental observations on partitioning and heuristic observations on column
permutation, all of which are used to improve compression rates. Based on the
theory, we devise the first on-line training algorithms for table compression,
which can be applied to individual files, not just continuously operating
sources; and also a new, off-line training algorithm, based on a link to the
asymmetric traveling salesman problem, which improves on prior work by
rearranging columns prior to partitioning. We demonstrate these results
experimentally. On various test files, the on-line algorithms provide 35-55%
improvement over gzip with negligible slowdown; the off-line reordering
provides up to 20% further improvement over partitioning alone. We also show
that a variation of the table compression problem is MAX-SNP hard.Comment: 22 pages, 2 figures, 5 tables, 23 references. Extended abstract
appears in Proc. 13th ACM-SIAM SODA, pp. 213-222, 200
An intelligent assistant for exploratory data analysis
In this paper we present an account of the main features of SNOUT, an intelligent assistant for exploratory data analysis (EDA) of social science survey data that incorporates a range of data mining techniques. EDA has much in common with existing data mining techniques: its main objective is to help an investigator reach an understanding of the important relationships ina data set rather than simply develop predictive models for selectd variables. Brief descriptions of a number of novel techniques developed for use in SNOUT are presented. These include heuristic variable level inference and classification, automatic category formation, the use of similarity trees to identify groups of related variables, interactive decision tree construction and model selection using a genetic algorithm
- …