71 research outputs found

    Ask-the-Expert: Active Learning Based Knowledge Discovery Using the Expert

    Get PDF
    Often the manual review of large data sets, either for purposes of labeling unlabeled instances or for classifying meaningful results from uninteresting (but statistically significant) ones is extremely resource intensive, especially in terms of subject matter expert (SME) time. Use of active learning has been shown to diminish this review time significantly. However, since active learning is an iterative process of learning a classifier based on a small number of SME-provided labels at each iteration, the lack of an enabling tool can hinder the process of adoption of these technologies in real-life, in spite of their labor-saving potential. In this demo we present ASK-the-Expert, an interactive tool that allows SMEs to review instances from a data set and provide labels within a single framework. ASK-the-Expert is powered by an active learning algorithm for training a classifier in the back end. We demonstrate this system in the context of an aviation safety application, but the tool can be adopted to work as a simple review and labeling tool as well, without the use of active learning

    Affinity Classification Problem by Stochastic Cellular Automata

    Full text link
    This work introduces a new problem, named as, affinity classification problem which is a generalization of the density classification problem. To solve this problem, we introduce temporally stochastic cellular automata where two rules are stochastically applied in each step on all cells of the automata. Our model is defined on 2-dimensional grid having affection capability. We show that this model can be used in several applications like modeling self-healing systems

    Distributed Monitoring of the R(sup 2) Statistic for Linear Regression

    Get PDF
    The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness

    Detecting Abnormal Machine Characteristics in Cloud Infrastructures

    Get PDF
    In the cloud computing environment resources are accessed as services rather than as a product. Monitoring this system for performance is crucial because of typical pay-peruse packages bought by the users for their jobs. With the huge number of machines currently in the cloud system, it is often extremely difficult for system administrators to keep track of all machines using distributed monitoring programs such as Ganglia1 which lacks system health assessment and summarization capabilities. To overcome this problem, we propose a technique for automated anomaly detection using machine performance data in the cloud. Our algorithm is entirely distributed and runs locally on each computing machine on the cloud in order to rank the machines in order of their anomalous behavior for given jobs. There is no need to centralize any of the performance data for the analysis and at the end of the analysis, our algorithm generates error reports, thereby allowing the system administrators to take corrective actions. Experiments performed on real data sets collected for different jobs validate the fact that our algorithm has a low overhead for tracking anomalous machines in a cloud infrastructure

    Interactive Multi-fidelity Learning for Cost-effective Adaptation of Language Model with Sparse Human Supervision

    Full text link
    Large language models (LLMs) have demonstrated remarkable capabilities in various tasks. However, their suitability for domain-specific tasks, is limited due to their immense scale at deployment, susceptibility to misinformation, and more importantly, high data annotation costs. We propose a novel Interactive Multi-Fidelity Learning (IMFL) framework for the cost-effective development of small domain-specific LMs under limited annotation budgets. Our approach formulates the domain-specific fine-tuning process as a multi-fidelity learning problem, focusing on identifying the optimal acquisition strategy that balances between low-fidelity automatic LLM annotations and high-fidelity human annotations to maximize model performance. We further propose an exploration-exploitation query strategy that enhances annotation diversity and informativeness, incorporating two innovative designs: 1) prompt retrieval that selects in-context examples from human-annotated samples to improve LLM annotation, and 2) variable batch size that controls the order for choosing each fidelity to facilitate knowledge distillation, ultimately enhancing annotation quality. Extensive experiments on financial and medical tasks demonstrate that IMFL achieves superior performance compared with single fidelity annotations. Given a limited budget of human annotation, IMFL significantly outperforms the human annotation baselines in all four tasks and achieves very close performance as human annotations on two of the tasks. These promising results suggest that the high human annotation costs in domain-specific tasks can be significantly reduced by employing IMFL, which utilizes fewer human annotations, supplemented with cheaper and faster LLM (e.g., GPT-3.5) annotations to achieve comparable performance.Comment: This work has been accepted by NeurIPS 202
    corecore