3,415 research outputs found
A survey on online active learning
Online active learning is a paradigm in machine learning that aims to select
the most informative data points to label from a data stream. The problem of
minimizing the cost associated with collecting labeled observations has gained
a lot of attention in recent years, particularly in real-world applications
where data is only available in an unlabeled form. Annotating each observation
can be time-consuming and costly, making it difficult to obtain large amounts
of labeled data. To overcome this issue, many active learning strategies have
been proposed in the last decades, aiming to select the most informative
observations for labeling in order to improve the performance of machine
learning models. These approaches can be broadly divided into two categories:
static pool-based and stream-based active learning. Pool-based active learning
involves selecting a subset of observations from a closed pool of unlabeled
data, and it has been the focus of many surveys and literature reviews.
However, the growing availability of data streams has led to an increase in the
number of approaches that focus on online active learning, which involves
continuously selecting and labeling observations as they arrive in a stream.
This work aims to provide an overview of the most recently proposed approaches
for selecting the most informative observations from data streams in the
context of online active learning. We review the various techniques that have
been proposed and discuss their strengths and limitations, as well as the
challenges and opportunities that exist in this area of research. Our review
aims to provide a comprehensive and up-to-date overview of the field and to
highlight directions for future work
A survey on computational intelligence approaches for predictive modeling in prostate cancer
Predictive modeling in medicine involves the development of computational models which are capable of analysing large amounts of data in order to predict healthcare outcomes for individual patients. Computational intelligence approaches are suitable when the data to be modelled are too complex forconventional statistical techniques to process quickly and eciently. These advanced approaches are based on mathematical models that have been especially developed for dealing with the uncertainty and imprecision which is typically found in clinical and biological datasets. This paper provides a survey of recent work on computational intelligence approaches that have been applied to prostate cancer predictive modeling, and considers the challenges which need to be addressed. In particular, the paper considers a broad definition of computational intelligence which includes evolutionary algorithms (also known asmetaheuristic optimisation, nature inspired optimisation algorithms), Artificial Neural Networks, Deep Learning, Fuzzy based approaches, and hybrids of these,as well as Bayesian based approaches, and Markov models. Metaheuristic optimisation approaches, such as the Ant Colony Optimisation, Particle Swarm Optimisation, and Artificial Immune Network have been utilised for optimising the performance of prostate cancer predictive models, and the suitability of these approaches are discussed
Randomized Reference Classifier with Gaussian Distribution and Soft Confusion Matrix Applied to the Improving Weak Classifiers
In this paper, an issue of building the RRC model using probability
distributions other than beta distribution is addressed. More precisely, in
this paper, we propose to build the RRR model using the truncated normal
distribution. Heuristic procedures for expected value and the variance of the
truncated-normal distribution are also proposed. The proposed approach is
tested using SCM-based model for testing the consequences of applying the
truncated normal distribution in the RRC model. The experimental evaluation is
performed using four different base classifiers and seven quality measures. The
results showed that the proposed approach is comparable to the RRC model built
using beta distribution. What is more, for some base classifiers, the
truncated-normal-based SCM algorithm turned out to be better at discovering
objects coming from minority classes.Comment: arXiv admin note: text overlap with arXiv:1901.0882
Designing labeled graph classifiers by exploiting the R\'enyi entropy of the dissimilarity representation
Representing patterns as labeled graphs is becoming increasingly common in
the broad field of computational intelligence. Accordingly, a wide repertoire
of pattern recognition tools, such as classifiers and knowledge discovery
procedures, are nowadays available and tested for various datasets of labeled
graphs. However, the design of effective learning procedures operating in the
space of labeled graphs is still a challenging problem, especially from the
computational complexity viewpoint. In this paper, we present a major
improvement of a general-purpose classifier for graphs, which is conceived on
an interplay between dissimilarity representation, clustering,
information-theoretic techniques, and evolutionary optimization algorithms. The
improvement focuses on a specific key subroutine devised to compress the input
data. We prove different theorems which are fundamental to the setting of the
parameters controlling such a compression operation. We demonstrate the
effectiveness of the resulting classifier by benchmarking the developed
variants on well-known datasets of labeled graphs, considering as distinct
performance indicators the classification accuracy, computing time, and
parsimony in terms of structural complexity of the synthesized classification
models. The results show state-of-the-art standards in terms of test set
accuracy and a considerable speed-up for what concerns the computing time.Comment: Revised versio
A systematic review of data quality issues in knowledge discovery tasks
Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust
A Ranking Distance Based Diversity Measure for Multiple Classifier Systems
International audienceMultiple classifier fusion belongs to the decision-level information fusion, which has been widely used in many pattern classification applications, especially when the single classifier is not competent. However, multiple classifier fusion can not assure the improvement of the classification accuracy. The diversity among those classifiers in the multiple classifier system (MCS) is crucial for improving the fused classification accuracy. Various diversity measures for MCS have been proposed, which are mainly based on the average sample-wise classification consistency between different member classifiers. In this paper, we propose to define the diversity between member classifiers from a different standpoint. If different member classifiers in an MCS are good at classifying different classes, i.e., there exist expert-classifiers for each concerned class, the improvement of the accuracy of classifier fusion can be expected. Each classifier has a ranking of classes in term of the classification accuracies, based on which, a new diversity measure is implemented using the ranking distance. A larger average ranking distance represents a higher diversity. The new proposed diversity measure is used together with each single classifier's performance on training samples to design and optimize the MCS. Experiments, simulations , and related analyses are provided to illustrate and validate our new proposed diversity measure
- …