1,367 research outputs found
Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models
Background: In this paper we present the approaches and methods employed in
order to deal with a large scale multi-label semantic indexing task of
biomedical papers. This work was mainly implemented within the context of the
BioASQ challenge of 2014. Methods: The main contribution of this work is a
multi-label ensemble method that incorporates a McNemar statistical
significance test in order to validate the combination of the constituent
machine learning algorithms. Some secondary contributions include a study on
the temporal aspects of the BioASQ corpus (observations apply also to the
BioASQ's super-set, the PubMed articles collection) and the proper adaptation
of the algorithms used to deal with this challenging classification task.
Results: The ensemble method we developed is compared to other approaches in
experimental scenarios with subsets of the BioASQ corpus giving positive
results. During the BioASQ 2014 challenge we obtained the first place during
the first batch and the third in the two following batches. Our success in the
BioASQ challenge proved that a fully automated machine-learning approach, which
does not implement any heuristics and rule-based approaches, can be highly
competitive and outperform other approaches in similar challenging contexts
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
Towards Understanding Fairness and its Composition in Ensemble Machine Learning
Machine Learning (ML) software has been widely adopted in modern society,
with reported fairness implications for minority groups based on race, sex,
age, etc. Many recent works have proposed methods to measure and mitigate
algorithmic bias in ML models. The existing approaches focus on single
classifier-based ML models. However, real-world ML models are often composed of
multiple independent or dependent learners in an ensemble (e.g., Random
Forest), where the fairness composes in a non-trivial way. How does fairness
compose in ensembles? What are the fairness impacts of the learners on the
ultimate fairness of the ensemble? Can fair learners result in an unfair
ensemble? Furthermore, studies have shown that hyperparameters influence the
fairness of ML models. Ensemble hyperparameters are more complex since they
affect how learners are combined in different categories of ensembles.
Understanding the impact of ensemble hyperparameters on fairness will help
programmers design fair ensembles. Today, we do not understand these fully for
different ensemble algorithms. In this paper, we comprehensively study popular
real-world ensembles: bagging, boosting, stacking and voting. We have developed
a benchmark of 168 ensemble models collected from Kaggle on four popular
fairness datasets. We use existing fairness metrics to understand the
composition of fairness. Our results show that ensembles can be designed to be
fairer without using mitigation techniques. We also identify the interplay
between fairness composition and data characteristics to guide fair ensemble
design. Finally, our benchmark can be leveraged for further research on fair
ensembles. To the best of our knowledge, this is one of the first and largest
studies on fairness composition in ensembles yet presented in the literature.Comment: Accepted at ICSE 202
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Open-source neural architecture search with ensemble and pre-trained networks
The training and optimization of neural networks,
using pre-trained, super learner and ensemble approaches is
explored. Neural networks, and in particular Convolutional
Neural Networks (CNNs), are often optimized using default
parameters. Neural Architecture Search (NAS) enables
multiple architectures to be evaluated prior to selection of the
optimal architecture. Our contribution is to develop, and make
available to the community, a system that integrates open
source tools for the neural architecture search (OpenNAS) of
image classification models. OpenNAS takes any dataset of
grayscale, or RGB images, and generates the optimal CNN
architecture. Particle Swarm Optimization (PSO), Ant Colony
Optimization (ACO) and pre-trained models serve as base
learners for ensembles. Meta learner algorithms are
subsequently applied to these base learners and the ensemble
performance on image classification problems is evaluated. Our
results show that a stacked generalization ensemble of
heterogeneous models is the most effective approach to image
classification within OpenNAS
Heterogeneous ensemble models for in-Hospital Mortality Prediction
The use of Electronic Health Records data have extensively grown as they become more accessible. In machine learning, they are used as input for a large array of problems, as the records are rich and contain different types of variables, including structured data (e.g., demographics), free text (e.g., medical notes), and time series data. In this work, we explore the use of these different types of data for the task of in-hospital mortality prediction, which seeks to predict the outcome of death for patients admitted at the hos pital. We built several machine learning models, - such as LSTM, TCN, and Logistic Regression for each data type, and combine them into a heterogeneous ensemble model using the stacking strategy. By applying deep learning algorithms of the state-of-the-art in classification tasks and using their predictions as a new representation for our data we could assess whether the classifier ensemble can leverage information extracted from models trained with different data types. Our experiments on a set of 20K ICU stays from the MIMIC-III dataset have shown that the ensemble method brings an increase of three percentage points, achieving an AUROC of 0.853 (95% CI [0.846,0.861]), a TP Rate of 0.800, and a weighted F-Score of 0.795.Com o crescimento da adoção de prontuários eletrônicos, e da acessibilidade da comunidade a esses dados, a área de aprendizado de máquina está fazendo o uso desses dados para a solução de uma vasta gama de problemas. Esses dados são ricos e complexos, e contam com uma diversidade grande de tipos de dados, como dados estruturados (e.g., dados demográficos), texto livre (e.g., exames e prontuário médico) e dados temporais (e.g., medições de sinais vitais). Neste trabalho, buscamos explorar essa diversidade de tipos de dados para a tarefa de predição de mortalidade durante a estadia no hospital. Mais especificamente, usando apenas a janela das primeiras 48h de estadía do paciente. Contruímos diversos modelos de classificação para essa tarefa - incluindo LSTM, TCN e Logistic Regression - para cada tipo de dado existente na nossa base de dados, aplicando algoritmos do estado-da-arte da área de deep learning. Usando o resultado da classifica ção obtido por esses modelos, modelos ensemble foram treinados. Com isso, é possível avaliar se esses modelos conseguem tentar melhorar qualidade da classificação. Nossos experimentos usaram um conjunto de mais de 20mil estadias em UTIs presente na base de dados MIMIC-III, e mostramos que o uso de ensemble melhora a performance final em 3 pontos percentuais, conseguindo um melhor resultado de AUROC de 0,853 (95% IC [0,846; 0,861]), um TP Rate de 0.800, e um weighted F-Score de 0.795
Rough set based ensemble classifier for web page classification
Combining the results of a number of individually trained classification systems to obtain a more accurate classifier is a widely used technique in pattern recognition. In this article, we have introduced a rough set based meta classifier to classify web pages. The proposed method consists of two parts. In the first part, the output of every individual classifier is considered for constructing a decision table. In the second part, rough set attribute reduction and rule generation processes are used on the decision table to construct a meta classifier. It has been shown that (1) the performance of the meta classifier is better than the performance of every constituent classifier and, (2) the meta classifier is optimal with respect to a quality measure defined in the article. Experimental studies show that the meta classifier improves accuracy of classification uniformly over some benchmark corpora and beats other ensemble approaches in accuracy by a decisive margin, thus demonstrating the theoretical results. Apart from this, it reduces the CPU load compared to other ensemble classification techniques by removing redundant classifiers from the combination
- …