1,781 research outputs found
A survey on online active learning
Online active learning is a paradigm in machine learning that aims to select
the most informative data points to label from a data stream. The problem of
minimizing the cost associated with collecting labeled observations has gained
a lot of attention in recent years, particularly in real-world applications
where data is only available in an unlabeled form. Annotating each observation
can be time-consuming and costly, making it difficult to obtain large amounts
of labeled data. To overcome this issue, many active learning strategies have
been proposed in the last decades, aiming to select the most informative
observations for labeling in order to improve the performance of machine
learning models. These approaches can be broadly divided into two categories:
static pool-based and stream-based active learning. Pool-based active learning
involves selecting a subset of observations from a closed pool of unlabeled
data, and it has been the focus of many surveys and literature reviews.
However, the growing availability of data streams has led to an increase in the
number of approaches that focus on online active learning, which involves
continuously selecting and labeling observations as they arrive in a stream.
This work aims to provide an overview of the most recently proposed approaches
for selecting the most informative observations from data streams in the
context of online active learning. We review the various techniques that have
been proposed and discuss their strengths and limitations, as well as the
challenges and opportunities that exist in this area of research. Our review
aims to provide a comprehensive and up-to-date overview of the field and to
highlight directions for future work
Employee Churn Prediction using Logistic Regression and Support Vector Machine
It is a challenge for Human Resource (HR) team to retain their existing employees than to hire a new one. For any company, losing their valuable employees is a loss in terms of time, money, productivity, and trust, etc. This loss could be possibly minimized if HR could beforehand find out their potential employees who are planning to quit their job hence, we investigated solving the employee churn problem through the machine learning perspective. We have designed machine learning models using supervised and classification-based algorithms like Logistic Regression and Support Vector Machine (SVM). The models are trained with the IBM HR employee dataset retrieved from https://kaggle.com and later fine-tuned to boost the performance of the models. Metrics such as precision, recall, confusion matrix, AUC, ROC curve were used to compare the performance of the models. The Logistic Regression model recorded an accuracy of 0.67, Sensitivity of 0.65, Specificity of 0.70, Type I Error of 0.30, Type II Error of 0.35, and AUC score of 0.73 where as SVM achieved an accuracy of 0.93 with Sensitivity of 0.98, Specificity of 0.88, Type I Error of 0.12, Type II Error of 0.01 and AUC score of 0.96
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
Class-Imbalanced Learning on Graphs: A Survey
The rapid advancement in data-driven research has increased the demand for
effective graph data analysis. However, real-world data often exhibits class
imbalance, leading to poor performance of machine learning models. To overcome
this challenge, class-imbalanced learning on graphs (CILG) has emerged as a
promising solution that combines the strengths of graph representation learning
and class-imbalanced learning. In recent years, significant progress has been
made in CILG. Anticipating that such a trend will continue, this survey aims to
offer a comprehensive understanding of the current state-of-the-art in CILG and
provide insights for future research directions. Concerning the former, we
introduce the first taxonomy of existing work and its connection to existing
imbalanced learning literature. Concerning the latter, we critically analyze
recent work in CILG and discuss urgent lines of inquiry within the topic.
Moreover, we provide a continuously maintained reading list of papers and code
at https://github.com/yihongma/CILG-Papers.Comment: submitted to ACM Computing Survey (CSUR
An Overview on Application of Machine Learning Techniques in Optical Networks
Today's telecommunication networks have become sources of enormous amounts of
widely heterogeneous data. This information can be retrieved from network
traffic traces, network alarms, signal quality indicators, users' behavioral
data, etc. Advanced mathematical tools are required to extract meaningful
information from these data and take decisions pertaining to the proper
functioning of the networks from the network-generated data. Among these
mathematical tools, Machine Learning (ML) is regarded as one of the most
promising methodological approaches to perform network-data analysis and enable
automated network self-configuration and fault management. The adoption of ML
techniques in the field of optical communication networks is motivated by the
unprecedented growth of network complexity faced by optical networks in the
last few years. Such complexity increase is due to the introduction of a huge
number of adjustable and interdependent system parameters (e.g., routing
configurations, modulation format, symbol rate, coding schemes, etc.) that are
enabled by the usage of coherent transmission/reception technologies, advanced
digital signal processing and compensation of nonlinear effects in optical
fiber propagation. In this paper we provide an overview of the application of
ML to optical communications and networking. We classify and survey relevant
literature dealing with the topic, and we also provide an introductory tutorial
on ML for researchers and practitioners interested in this field. Although a
good number of research papers have recently appeared, the application of ML to
optical networks is still in its infancy: to stimulate further work in this
area, we conclude the paper proposing new possible research directions
Real-to-Virtual Domain Unification for End-to-End Autonomous Driving
In the spectrum of vision-based autonomous driving, vanilla end-to-end models
are not interpretable and suboptimal in performance, while mediated perception
models require additional intermediate representations such as segmentation
masks or detection bounding boxes, whose annotation can be prohibitively
expensive as we move to a larger scale. More critically, all prior works fail
to deal with the notorious domain shift if we were to merge data collected from
different sources, which greatly hinders the model generalization ability. In
this work, we address the above limitations by taking advantage of virtual data
collected from driving simulators, and present DU-drive, an unsupervised
real-to-virtual domain unification framework for end-to-end autonomous driving.
It first transforms real driving data to its less complex counterpart in the
virtual domain and then predicts vehicle control commands from the generated
virtual image. Our framework has three unique advantages: 1) it maps driving
data collected from a variety of source distributions into a unified domain,
effectively eliminating domain shift; 2) the learned virtual representation is
simpler than the input real image and closer in form to the "minimum sufficient
statistic" for the prediction task, which relieves the burden of the
compression phase while optimizing the information bottleneck tradeoff and
leads to superior prediction performance; 3) it takes advantage of annotated
virtual data which is unlimited and free to obtain. Extensive experiments on
two public driving datasets and two driving simulators demonstrate the
performance superiority and interpretive capability of DU-drive
- …