36 research outputs found
Compact Random Feature Maps
Kernel approximation using randomized feature maps has recently gained a lot
of interest. In this work, we identify that previous approaches for polynomial
kernel approximation create maps that are rank deficient, and therefore do not
utilize the capacity of the projected feature space effectively. To address
this challenge, we propose compact random feature maps (CRAFTMaps) to
approximate polynomial kernels more concisely and accurately. We prove the
error bounds of CRAFTMaps demonstrating their superior kernel reconstruction
performance compared to the previous approximation schemes. We show how
structured random matrices can be used to efficiently generate CRAFTMaps, and
present a single-pass algorithm using CRAFTMaps to learn non-linear multi-class
classifiers. We present experiments on multiple standard data-sets with
performance competitive with state-of-the-art results.Comment: 9 page
Fast Query-Optimized Kernel-Machine Classification
A recently developed algorithm performs kernel-machine classification via incremental approximate nearest support vectors. The algorithm implements support-vector machines (SVMs) at speeds 10 to 100 times those attainable by use of conventional SVM algorithms. The algorithm offers potential benefits for classification of images, recognition of speech, recognition of handwriting, and diverse other applications in which there are requirements to discern patterns in large sets of data. SVMs constitute a subset of kernel machines (KMs), which have become popular as models for machine learning and, more specifically, for automated classification of input data on the basis of labeled training data. While similar in many ways to k-nearest-neighbors (k-NN) models and artificial neural networks (ANNs), SVMs tend to be more accurate. Using representations that scale only linearly in the numbers of training examples, while exploring nonlinear (kernelized) feature spaces that are exponentially larger than the original input dimensionality, KMs elegantly and practically overcome the classic curse of dimensionality. However, the price that one must pay for the power of KMs is that query-time complexity scales linearly with the number of training examples, making KMs often orders of magnitude more computationally expensive than are ANNs, decision trees, and other popular machine learning alternatives. The present algorithm treats an SVM classifier as a special form of a k-NN. The algorithm is based partly on an empirical observation that one can often achieve the same classification as that of an exact KM by using only small fraction of the nearest support vectors (SVs) of a query. The exact KM output is a weighted sum over the kernel values between the query and the SVs. In this algorithm, the KM output is approximated with a k-NN classifier, the output of which is a weighted sum only over the kernel values involving k selected SVs. Before query time, there are gathered statistics about how misleading the output of the k-NN model can be, relative to the outputs of the exact KM for a representative set of examples, for each possible k from 1 to the total number of SVs. From these statistics, there are derived upper and lower thresholds for each step k. These thresholds identify output levels for which the particular variant of the k-NN model already leans so strongly positively or negatively that a reversal in sign is unlikely, given the weaker SV neighbors still remaining. At query time, the partial output of each query is incrementally updated, stopping as soon as it exceeds the predetermined statistical thresholds of the current step. For an easy query, stopping can occur as early as step k = 1. For more difficult queries, stopping might not occur until nearly all SVs are touched. A key empirical observation is that this approach can tolerate very approximate nearest-neighbor orderings. In experiments, SVs and queries were projected to a subspace comprising the top few principal- component dimensions and neighbor orderings were computed in that subspace. This approach ensured that the overhead of the nearest-neighbor computations was insignificant, relative to that of the exact KM computation
Automated Knowledge Discovery From Simulators
A computational method, SimLearn, has been devised to facilitate efficient knowledge discovery from simulators. Simulators are complex computer programs used in science and engineering to model diverse phenomena such as fluid flow, gravitational interactions, coupled mechanical systems, and nuclear, chemical, and biological processes. SimLearn uses active-learning techniques to efficiently address the "landscape characterization problem." In particular, SimLearn tries to determine which regions in "input space" lead to a given output from the simulator, where "input space" refers to an abstraction of all the variables going into the simulator, e.g., initial conditions, parameters, and interaction equations. Landscape characterization can be viewed as an attempt to invert the forward mapping of the simulator and recover the inputs that produce a particular output. Given that a single simulation run can take days or weeks to complete even on a large computing cluster, SimLearn attempts to reduce costs by reducing the number of simulations needed to effect discoveries. Unlike conventional data-mining methods that are applied to static predefined datasets, SimLearn involves an iterative process in which a most informative dataset is constructed dynamically by using the simulator as an oracle. On each iteration, the algorithm models the knowledge it has gained through previous simulation trials and then chooses which simulation trials to run next. Running these trials through the simulator produces new data in the form of input-output pairs. The overall process is embodied in an algorithm that combines support vector machines (SVMs) with active learning. SVMs use learning from examples (the examples are the input-output pairs generated by running the simulator) and a principle called maximum margin to derive predictors that generalize well to new inputs. In SimLearn, the SVM plays the role of modeling the knowledge that has been gained through previous simulation trials. Active learning is used to determine which new input points would be most informative if their output were known. The selected input points are run through the simulator to generate new information that can be used to refine the SVM. The process is then repeated. SimLearn carefully balances exploration (semi-randomly searching around the input space) versus exploitation (using the current state of knowledge to conduct a tightly focused search). During each iteration, SimLearn uses not one, but an ensemble of SVMs. Each SVM in the ensemble is characterized by different hyper-parameters that control various aspects of the learned predictor - for example, whether the predictor is constrained to be very smooth (nearby points in input space lead to similar output predictions) or whether the predictor is allowed to be "bumpy." The various SVMs will have different preferences about which input points they would like to run through the simulator next. SimLearn includes a formal mechanism for balancing the ensemble SVM preferences so that a single choice can be made for the next set of trials
HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition
In image classification, visual separability between different object
categories is highly uneven, and some categories are more difficult to
distinguish than others. Such difficult categories demand more dedicated
classifiers. However, existing deep convolutional neural networks (CNN) are
trained as flat N-way classifiers, and few efforts have been made to leverage
the hierarchical structure of categories. In this paper, we introduce
hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category
hierarchy. An HD-CNN separates easy classes using a coarse category classifier
while distinguishing difficult classes using fine category classifiers. During
HD-CNN training, component-wise pretraining is followed by global finetuning
with a multinomial logistic loss regularized by a coarse category consistency
term. In addition, conditional executions of fine category classifiers and
layer parameter compression make HD-CNNs scalable for large-scale visual
recognition. We achieve state-of-the-art results on both CIFAR100 and
large-scale ImageNet 1000-class benchmark datasets. In our experiments, we
build up three different HD-CNNs and they lower the top-1 error of the standard
CNNs by 2.65%, 3.1% and 1.1%, respectively.Comment: Add new results on ImageNet using VGG-16-layer building block ne
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
The pre-training and fine-tuning paradigm has contributed to a number of
breakthroughs in Natural Language Processing (NLP). Instead of directly
training on a downstream task, language models are first pre-trained on large
datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then
fine-tuned on task-specific data (e.g., natural language generation, text
summarization, etc.). Scaling the model and dataset size has helped improve the
performance of LLMs, but unfortunately, this also lead to highly prohibitive
computational costs. Pre-training LLMs often require orders of magnitude more
FLOPs than fine-tuning and the model capacity often remains the same between
the two phases. To achieve training efficiency w.r.t training FLOPs, we propose
to decouple the model capacity between the two phases and introduce Sparse
Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits
of using unstructured weight sparsity to train only a subset of weights during
pre-training (Sparse Pre-training) and then recover the representational
capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We
demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3
XL model resulting in a 2.5x reduction in pre-training FLOPs, without a
significant loss in accuracy on the downstream tasks relative to the dense
baseline. By rigorously evaluating multiple downstream tasks, we also establish
a relationship between sparsity, task complexity and dataset size. Our work
presents a promising direction to train large GPT models at a fraction of the
training FLOPs using weight sparsity, while retaining the benefits of
pre-trained textual representations for downstream tasks.Comment: Accepted to Uncertainty in Artificial Intelligence (UAI) 2023
Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary
Material
additional support from Ameritech, an Institute Partner, and fromIBM. Towards a Qualitative Theory of Safety Control When and How to Panic Intelligently
Safe control of a physical system requires the ability to both detect and avoid dangerous situations, while striving to achieve performance goals. State-of-the-art con-trollers still tend to rely heavily on classical control theory with feedback, occasionally with some limited use of associational-reasoning (expert systems) to help detect threatening situations and recall standard recovery procedures [Dvorak, 1987]. The need for model-based reasoning to control in novel situations and with incomplete data is widely acknowledged, but largely unaddressed to date. Existing relevant work has tended to focus on model-based monitoring to track the system state over time, with little attention to reasoning about control itself. In such work, any pro-visions for finding safe control actions tend to be mixtures of associational-reasoning and general-purpose planning or qualitative simulation- which suffer problems of brittleness and intractability, respectively. To address those shortcomings, I propose a qualitative theory of safety control. The goal is to make explicit the kinds of intuitions that human operators use to focu