1,589 research outputs found
Online Tool Condition Monitoring Based on Parsimonious Ensemble+
Accurate diagnosis of tool wear in metal turning process remains an open
challenge for both scientists and industrial practitioners because of
inhomogeneities in workpiece material, nonstationary machining settings to suit
production requirements, and nonlinear relations between measured variables and
tool wear. Common methodologies for tool condition monitoring still rely on
batch approaches which cannot cope with a fast sampling rate of metal cutting
process. Furthermore they require a retraining process to be completed from
scratch when dealing with a new set of machining parameters. This paper
presents an online tool condition monitoring approach based on Parsimonious
Ensemble+, pENsemble+. The unique feature of pENsemble+ lies in its highly
flexible principle where both ensemble structure and base-classifier structure
can automatically grow and shrink on the fly based on the characteristics of
data streams. Moreover, the online feature selection scenario is integrated to
actively sample relevant input attributes. The paper presents advancement of a
newly developed ensemble learning algorithm, pENsemble+, where online active
learning scenario is incorporated to reduce operator labelling effort. The
ensemble merging scenario is proposed which allows reduction of ensemble
complexity while retaining its diversity. Experimental studies utilising
real-world manufacturing data streams and comparisons with well known
algorithms were carried out. Furthermore, the efficacy of pENsemble was
examined using benchmark concept drift data streams. It has been found that
pENsemble+ incurs low structural complexity and results in a significant
reduction of operator labelling effort.Comment: this paper has been published by IEEE Transactions on Cybernetic
MACHINE LEARNING OPERATIONS (MLOPS) ARCHITECTURE CONSIDERATIONS FOR DEEP LEARNING WITH A PASSIVE ACOUSTIC VECTOR SENSOR
As machine learning augmented decision-making becomes more prevalent, defense applications for these techniques are needed to prevent being outpaced by peer adversaries. One area that has significant potential is deep learning applications to classify passive sonar acoustic signatures, which would accelerate tactical, operational, and strategic decision-making processes in one of the most contested and difficult warfare domains. Convolutional Neural Networks have achieved some of the greatest success in accomplishing this task; however, a full production pipeline to continually train, deploy, and evaluate acoustic deep learning models throughout their lifecycle in a realistic architecture is a barrier to further and more rapid success in this field of research. Two main contributions of this thesis are a proposed production architecture for model lifecycle management using Machine Learning Operations (MLOps) and evaluation of the same on live passive sonar stream. Using the proposed production architecture, this work evaluates model performance differences in a production setting and explores methods to improve model performance in production. Through documenting considerations for creating a platform and architecture to continuously train, deploy, and evaluate various deep learning acoustic classification models, this study aims to create a framework and recommendations to accelerate progress in acoustic deep learning classification research.Los Alamos National LabLieutenant, United States NavyApproved for public release. Distribution is unlimited
Multi-label classification via incremental clustering on an evolving data stream.
With the advancement of storage and processing technology, an enormous amount of data is collected on a daily basis in many applications. Nowadays, advanced data analytics have been used to mine the collected data for useful information and make predictions, contributing to the competitive advantages of companies. The increasing data volume, however, has posed many problems to classical batch learning systems, such as the need to retrain the model completely with the newly arrived samples or the impracticality of storing and accessing a large volume of data. This has prompted interest on incremental learning that operates on data streams. In this study, we develop an incremental online multi-label classification (OMLC) method based on a weighted clustering model. The model is made to adapt to the change of data via the decay mechanism in which each sample's weight dwindles away over time. The clustering model therefore always focuses more on newly arrived samples. In the classification process, only clusters whose weights are greater than a threshold (called mature clusters) are employed to assign labels for the samples. In our method, not only is the clustering model incrementally maintained with the revealed ground truth labels of the arrived samples, the number of predicted labels in a sample are also adjusted based on the Hoeffding inequality and the label cardinality. The experimental results show that our method is competitive compared to several well-known benchmark algorithms on six performance measures in both the stationary and the concept drift settings
Efficient Online Decision Tree Learning with Active Feature Acquisition
Constructing decision trees online is a classical machine learning problem.
Existing works often assume that features are readily available for each
incoming data point. However, in many real world applications, both feature
values and the labels are unknown a priori and can only be obtained at a cost.
For example, in medical diagnosis, doctors have to choose which tests to
perform (i.e., making costly feature queries) on a patient in order to make a
diagnosis decision (i.e., predicting labels). We provide a fresh perspective to
tackle this practical challenge. Our framework consists of an active planning
oracle embedded in an online learning scheme for which we investigate several
information acquisition functions. Specifically, we employ a surrogate
information acquisition function based on adaptive submodularity to actively
query feature values with a minimal cost, while using a posterior sampling
scheme to maintain a low regret for online prediction. We demonstrate the
efficiency and effectiveness of our framework via extensive experiments on
various real-world datasets. Our framework also naturally adapts to the
challenging setting of online learning with concept drift and is shown to be
competitive with baseline models while being more flexible
COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments
An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch
Continual learning from stationary and non-stationary data
Continual learning aims at developing models that are capable of working on constantly evolving problems over a long-time horizon. In such environments, we can distinguish three essential aspects of training and maintaining machine learning models - incorporating new knowledge, retaining it and reacting to changes. Each of them poses its own challenges, constituting a compound problem with multiple goals.
Remembering previously incorporated concepts is the main property of a model that is required when dealing with stationary distributions. In non-stationary environments, models should be capable of selectively forgetting outdated decision boundaries and adapting to new concepts. Finally, a significant difficulty can be found in combining these two abilities within a single learning algorithm, since, in such scenarios, we have to balance remembering and forgetting instead of focusing only on one aspect.
The presented dissertation addressed these problems in an exploratory way. Its main goal was to grasp the continual learning paradigm as a whole, analyze its different branches and tackle identified issues covering various aspects of learning from sequentially incoming data. By doing so, this work not only filled several gaps in the current continual learning research but also emphasized the complexity and diversity of challenges existing in this domain. Comprehensive experiments conducted for all of the presented contributions have demonstrated their effectiveness and substantiated the validity of the stated claims
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
• To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
• To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
- …