71 research outputs found
Ask-the-Expert: Active Learning Based Knowledge Discovery Using the Expert
Often the manual review of large data sets, either for purposes of labeling unlabeled instances or for classifying meaningful results from uninteresting (but statistically significant) ones is extremely resource intensive, especially in terms of subject matter expert (SME) time. Use of active learning has been shown to diminish this review time significantly. However, since active learning is an iterative process of learning a classifier based on a small number of SME-provided labels at each iteration, the lack of an enabling tool can hinder the process of adoption of these technologies in real-life, in spite of their labor-saving potential. In this demo we present ASK-the-Expert, an interactive tool that allows SMEs to review instances from a data set and provide labels within a single framework. ASK-the-Expert is powered by an active learning algorithm for training a classifier in the back end. We demonstrate this system in the context of an aviation safety application, but the tool can be adopted to work as a simple review and labeling tool as well, without the use of active learning
Affinity Classification Problem by Stochastic Cellular Automata
This work introduces a new problem, named as, affinity classification problem
which is a generalization of the density classification problem. To solve this
problem, we introduce temporally stochastic cellular automata where two rules
are stochastically applied in each step on all cells of the automata. Our model
is defined on 2-dimensional grid having affection capability. We show that this
model can be used in several applications like modeling self-healing systems
Distributed Monitoring of the R(sup 2) Statistic for Linear Regression
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness
Detecting Abnormal Machine Characteristics in Cloud Infrastructures
In the cloud computing environment resources are accessed as services rather than as a product. Monitoring this system for performance is crucial because of typical pay-peruse packages bought by the users for their jobs. With the huge number of machines currently in the cloud system, it is often extremely difficult for system administrators to keep track of all machines using distributed monitoring programs such as Ganglia1 which lacks system health assessment and summarization capabilities. To overcome this problem, we propose a technique for automated anomaly detection using machine performance data in the cloud. Our algorithm is entirely distributed and runs locally on each computing machine on the cloud in order to rank the machines in order of their anomalous behavior for given jobs. There is no need to centralize any of the performance data for the analysis and at the end of the analysis, our algorithm generates error reports, thereby allowing the system administrators to take corrective actions. Experiments performed on real data sets collected for different jobs validate the fact that our algorithm has a low overhead for tracking anomalous machines in a cloud infrastructure
Interactive Multi-fidelity Learning for Cost-effective Adaptation of Language Model with Sparse Human Supervision
Large language models (LLMs) have demonstrated remarkable capabilities in
various tasks. However, their suitability for domain-specific tasks, is limited
due to their immense scale at deployment, susceptibility to misinformation, and
more importantly, high data annotation costs. We propose a novel Interactive
Multi-Fidelity Learning (IMFL) framework for the cost-effective development of
small domain-specific LMs under limited annotation budgets. Our approach
formulates the domain-specific fine-tuning process as a multi-fidelity learning
problem, focusing on identifying the optimal acquisition strategy that balances
between low-fidelity automatic LLM annotations and high-fidelity human
annotations to maximize model performance. We further propose an
exploration-exploitation query strategy that enhances annotation diversity and
informativeness, incorporating two innovative designs: 1) prompt retrieval that
selects in-context examples from human-annotated samples to improve LLM
annotation, and 2) variable batch size that controls the order for choosing
each fidelity to facilitate knowledge distillation, ultimately enhancing
annotation quality. Extensive experiments on financial and medical tasks
demonstrate that IMFL achieves superior performance compared with single
fidelity annotations. Given a limited budget of human annotation, IMFL
significantly outperforms the human annotation baselines in all four tasks and
achieves very close performance as human annotations on two of the tasks. These
promising results suggest that the high human annotation costs in
domain-specific tasks can be significantly reduced by employing IMFL, which
utilizes fewer human annotations, supplemented with cheaper and faster LLM
(e.g., GPT-3.5) annotations to achieve comparable performance.Comment: This work has been accepted by NeurIPS 202
- …