26,955 research outputs found
SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization
Computer vision is experiencing an AI renaissance, in which machine learning
models are expediting important breakthroughs in academic research and
commercial applications. Effectively training these models, however, is not
trivial due in part to hyperparameters: user-configured values that control a
model's ability to learn from data. Existing hyperparameter optimization
methods are highly parallel but make no effort to balance the search across
heterogeneous hardware or to prioritize searching high-impact spaces. In this
paper, we introduce a framework for massively Scalable Hardware-Aware
Distributed Hyperparameter Optimization (SHADHO). Our framework calculates the
relative complexity of each search space and monitors performance on the
learning task over all trials. These metrics are then used as heuristics to
assign hyperparameters to distributed workers based on their hardware. We first
demonstrate that our framework achieves double the throughput of a standard
distributed hyperparameter optimization framework by optimizing SVM for MNIST
using 150 distributed workers. We then conduct model search with SHADHO over
the course of one week using 74 GPUs across two compute clusters to optimize
U-Net for a cell segmentation task, discovering 515 models that achieve a lower
validation loss than standard U-Net.Comment: 10 pages, 6 figure
The Value-of-Information in Matching with Queues
We consider the problem of \emph{optimal matching with queues} in dynamic
systems and investigate the value-of-information. In such systems, the
operators match tasks and resources stored in queues, with the objective of
maximizing the system utility of the matching reward profile, minus the average
matching cost. This problem appears in many practical systems and the main
challenges are the no-underflow constraints, and the lack of matching-reward
information and system dynamics statistics. We develop two online matching
algorithms: Learning-aided Reward optimAl Matching () and
Dual- () to effectively resolve both challenges.
Both algorithms are equipped with a learning module for estimating the
matching-reward information, while incorporates an additional
module for learning the system dynamics. We show that both algorithms achieve
an close-to-optimal utility performance for any
, while achieves a faster convergence speed and a
better delay compared to , i.e., delay and convergence under
compared to delay and convergence under
( and are maximum estimation errors for
reward and system dynamics). Our results reveal that information of different
system components can play very different roles in algorithm performance and
provide a systematic way for designing joint learning-control algorithms for
dynamic systems
Managing Uncertainty: A Case for Probabilistic Grid Scheduling
The Grid technology is evolving into a global, service-orientated
architecture, a universal platform for delivering future high demand
computational services. Strong adoption of the Grid and the utility computing
concept is leading to an increasing number of Grid installations running a wide
range of applications of different size and complexity. In this paper we
address the problem of elivering deadline/economy based scheduling in a
heterogeneous application environment using statistical properties of job
historical executions and its associated meta-data. This approach is motivated
by a study of six-month computational load generated by Grid applications in a
multi-purpose Grid cluster serving a community of twenty e-Science projects.
The observed job statistics, resource utilisation and user behaviour is
discussed in the context of management approaches and models most suitable for
supporting a probabilistic and autonomous scheduling architecture
- …