7,215 research outputs found
Harvesting Data from Advanced Technologies
Data streams are emerging everywhere such as Web logs, Web page click streams, sensor data streams, and credit card transaction flows. Different from traditional data sets, data streams are sequentially generated and arrive one by one rather than being available for random access before learning begins, and they are potentially huge or even infinite that it is impractical to store the whole data. To study learning from data streams, we target online learning, which generates a bestāso far model on the fly by sequentially feeding in the newly arrived data, updates the model as needed, and then applies the learned model for accurate real-time prediction or classification in real-world applications. Several challenges arise from this scenario: first, data is not available for random access or even multiple access; second, data imbalance is a common situation; third, the performance of the model should be reasonable even when the amount of data is limited; fourth, the model should be updated easily but not frequently; and finally, the model should always be ready for prediction and classification. To meet these challenges, we investigate streaming feature selection by taking advantage of mutual information and group structures among candidate features. Streaming feature selection reduces the number of features by removing noisy, irrelevant, or redundant features and selecting relevant features on the fly, and brings about palpable effects for applications: speeding up the learning process, improving learning accuracy, enhancing generalization capability, and improving model interpretation. Compared with traditional feature selection, which can only handle pre-given data sets without considering the potential group structures among candidate features, streaming feature selection is able to handle streaming data and select meaningful and valuable feature sets with or without group structures on the fly. In this research, we propose 1) a novel streaming feature selection algorithm (GFSSF, Group Feature Selection with Streaming Features) by exploring mutual information and group structures among candidate features for both group and individual levels of feature selection from streaming data, 2) a lazy online prediction model with data fusion, feature selection and weighting technologies for real-time traffic prediction from heterogeneous sensor data streams, 3) a lazy online learning model (LB, Live Bayes) with dynamic resampling technology to learn from imbalanced embedded mobile sensor data streams for real-time activity recognition and user recognition, and 4) a lazy update online learning model (CMLR, Cost-sensitive Multinomial Logistic Regression) with streaming feature selection for accurate real-time classification from imbalanced and small sensor data streams. Finally, by integrating traffic flow theory, advanced sensors, data gathering, data fusion, feature selection and weighting, online learning and visualization technologies to estimate and visualize the current and future traffic, a real-time transportation prediction system named VTraffic is built for the Vermont Agency of Transportation
Imbal-OL: Online Machine Learning from Imbalanced Data Streams in Real-world IoT
Typically a Neural Networks (NN) is trained on data
centers using historic datasets, then a C source file (model as a
char array) of the trained model is generated and flashed on IoT
devices. This standard process impedes the flexibility of billions of
deployed ML-powered devices as they cannot learn unseen/fresh
data patterns (static intelligence) and are impossible to adapt
to dynamic scenarios. Currently, to address this issue, Online
Machine Learning (OL) algorithms are deployed on IoT devices
that provide devices the ability to locally re-train themselves -
continuously updating the last few NN layers using unseen data
patterns encountered after deployment.
In OL, catastrophic forgetting is common when NNs are
trained using non-stationary data distribution. The majority of
recent work in the OL domain embraces the implicit assumption
that the distribution of local training data is balanced. But the
fact is, the sensor data streams in real-world IoT are severely
imbalanced and temporally correlated. This paper introduces
Imbal-OL, a resource-friendly technique that can be used as
an OL plugin to balance the size of classes in a range of data
streams. When Imbal-OL processed stream is used for OL, the
models can adapt faster to changes in the stream while parallelly
preventing catastrophic forgetting. Experimental evaluation of
Imbal-OL using CIFAR datasets over ResNet-18 demonstrates
its ability to deal with imperfect data streams, as it manages
to produce high-quality models even under challenging learning
setting
Explainable Lifelong Stream Learning Based on "Glocal" Pairwise Fusion
Real-time on-device continual learning applications are used on mobile
phones, consumer robots, and smart appliances. Such devices have limited
processing and memory storage capabilities, whereas continual learning acquires
data over a long period of time. By necessity, lifelong learning algorithms
have to be able to operate under such constraints while delivering good
performance. This study presents the Explainable Lifelong Learning (ExLL)
model, which incorporates several important traits: 1) learning to learn, in a
single pass, from streaming data with scarce examples and resources; 2) a
self-organizing prototype-based architecture that expands as needed and
clusters streaming data into separable groups by similarity and preserves data
against catastrophic forgetting; 3) an interpretable architecture to convert
the clusters into explainable IF-THEN rules as well as to justify model
predictions in terms of what is similar and dissimilar to the inference; and 4)
inferences at the global and local level using a pairwise decision fusion
process to enhance the accuracy of the inference, hence ``Glocal Pairwise
Fusion.'' We compare ExLL against contemporary online learning algorithms for
image recognition, using OpenLoris, F-SIOL-310, and Places datasets to evaluate
several continual learning scenarios for video streams, low-sample learning,
ability to scale, and imbalanced data streams. The algorithms are evaluated for
their performance in accuracy, number of parameters, and experiment runtime
requirements. ExLL outperforms all algorithms for accuracy in the majority of
the tested scenarios.Comment: 24 pages, 8 figure
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
- ā¦