5,010,766 research outputs found
Darwinian Data Structure Selection
Data structure selection and tuning is laborious but can vastly improve an
application's performance and memory footprint. Some data structures share a
common interface and enjoy multiple implementations. We call them Darwinian
Data Structures (DDS), since we can subject their implementations to survival
of the fittest. We introduce ARTEMIS a multi-objective, cloud-based
search-based optimisation framework that automatically finds optimal, tuned DDS
modulo a test suite, then changes an application to use that DDS. ARTEMIS
achieves substantial performance improvements for \emph{every} project in
Java projects from DaCapo benchmark, popular projects and uniformly
sampled projects from GitHub. For execution time, CPU usage, and memory
consumption, ARTEMIS finds at least one solution that improves \emph{all}
measures for () of the projects. The median improvement across
the best solutions is , , for runtime, memory and CPU
usage.
These aggregate results understate ARTEMIS's potential impact. Some of the
benchmarks it improves are libraries or utility functions. Two examples are
gson, a ubiquitous Java serialization framework, and xalan, Apache's XML
transformation tool. ARTEMIS improves gson by \%, and for
memory, runtime, and CPU; ARTEMIS improves xalan's memory consumption by
\%. \emph{Every} client of these projects will benefit from these
performance improvements.Comment: 11 page
Structure Selection from Streaming Relational Data
Statistical relational learning techniques have been successfully applied in
a wide range of relational domains. In most of these applications, the human
designers capitalized on their background knowledge by following a
trial-and-error trajectory, where relational features are manually defined by a
human engineer, parameters are learned for those features on the training data,
the resulting model is validated, and the cycle repeats as the engineer adjusts
the set of features. This paper seeks to streamline application development in
large relational domains by introducing a light-weight approach that
efficiently evaluates relational features on pieces of the relational graph
that are streamed to it one at a time. We evaluate our approach on two social
media tasks and demonstrate that it leads to more accurate models that are
learned faster
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Information-based objective functions for active data selection
Learning can be made more efficient if we can actively select particularly salient data points. Within a Bayesian learning framework, objective functions are discussed that measure the expected informativeness of candidate measurements. Three alternative specifications of what we want to gain information about lead to three different criteria for data selection. All these criteria depend on the assumption that the hypothesis space is correct, which may prove to be their main weakness
Feature subset selection and ranking for data dimensionality reduction
A new unsupervised forward orthogonal search (FOS) algorithm is introduced for feature selection and ranking. In the new algorithm, features are selected in a stepwise way, one at a time, by estimating the capability of each specified candidate feature subset to represent the overall features in the measurement space. A squared correlation function is employed as the criterion to measure the dependency between features and this makes the new algorithm easy to implement. The forward orthogonalization strategy, which combines good effectiveness with high efficiency, enables the new algorithm to produce efficient feature subsets with a clear physical interpretation
- …
