1,212 research outputs found
Model and Algorithm Selection in Statistical Learning and Optimization.
Modern data-driven statistical techniques, e.g., non-linear classification and
regression machine learning methods, play an increasingly important role in applied data analysis
and quantitative research. For real-world we do not know
a priori which methods will work best. Furthermore, most of the available models depend on
so called hyper- or control parameters, which can drastically influence their performance.
This leads to a vast space of potential models, which cannot be explored exhaustively.
Modern optimization techniques, often either evolutionary or model-based, are employed to speed up
this process.
A very similar problem occurs in continuous and discrete optimization and, in general,
in many other areas where problem instances are solved by algorithmic approaches: Many competing
techniques exist, some of them heavily parametrized. Again, not much knowledge
exists, how, given a certain application, one makes the correct choice here.
These general problems are called algorithm selection and algorithm configuration. Instead of relying on
tedious, manual trial-and-error, one should rather employ available computational power
in a methodical fashion to obtain an appropriate algorithmic choice, while supporting this
process with machine-learning techniques to discover and exploit as much of the
search space structure as possible.
In this cumulative dissertation I summarize nine papers that deal with the problem of model and
algorithm selection in the areas of machine learning and optimization. Issues in benchmarking,
resampling, efficient model tuning, feature selection and automatic algorithm selection are addressed and
solved using modern techniques. I apply these methods to tasks from engineering, music data analysis
and black-box optimization.
The dissertation concludes by summarizing my published R packages for such tasks and specifically
discusses two packages for parallelization on high performance computing clusters and parallel statistical
experiments
Learning Interpretable Rules for Multi-label Classification
Multi-label classification (MLC) is a supervised learning problem in which,
contrary to standard multiclass classification, an instance can be associated
with several class labels simultaneously. In this chapter, we advocate a
rule-based approach to multi-label classification. Rule learning algorithms are
often employed when one is not only interested in accurate predictions, but
also requires an interpretable theory that can be understood, analyzed, and
qualitatively evaluated by domain experts. Ideally, by revealing patterns and
regularities contained in the data, a rule-based theory yields new insights in
the application domain. Recently, several authors have started to investigate
how rule-based models can be used for modeling multi-label data. Discussing
this task in detail, we highlight some of the problems that make rule learning
considerably more challenging for MLC than for conventional classification.
While mainly focusing on our own previous work, we also provide a short
overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models
in Computer Vision and Machine Learning. The Springer Series on Challenges in
Machine Learning. Springer (2018). See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further
informatio
Improving supervised music classification by means of multi-objective evolutionary feature selection
In this work, several strategies are developed to reduce the impact of the two limitations of most current studies in supervised music classification: the classification rules and music features have often a low interpretability, and the evaluation of algorithms and feature subsets is almost always done with respect to only one or a few common evaluation criteria separately.
Although music classification is in most cases user-centered and it is desired to understand well the properties of related music categories, many current approaches are based on low-level characteristics of the audio signal. We have designed a large set of more meaningful and interpretable high-level features, which may completely replace the baseline low-level feature set and are even capable to significantly outperform it for the categorisation into three music styles. These features provide a comprehensible insight into the properties of music genres and styles: instrumentation, moods, harmony, temporal, and melodic characteristics. A crucial advantage of audio high-level features is that they can be extracted from any digitally available music piece, independently of its popularity, availability of the corresponding score, or the Internet connection for the download of the metadata and community features, which are sometimes erroneous and incomplete. A part of high-level features, which are particularly successful for classification into genres and styles, has been developed based on the novel approach called sliding feature selection. Here, high-level features are estimated from low-level and other high-level ones during a sequence of supervised classification steps, and an integrated evolutionary feature selection helps to search for the most relevant features in each step of this sequence.
Another drawback of many related state-of-the-art studies is that the algorithms and feature sets are almost always compared using only one or a few evaluation criteria separately. However, different evaluation criteria are often in conflict: an algorithm optimised only with respect to classification quality may be slow, have high storage demands, perform worse on imbalanced data, or require high user efforts for labelling of songs. The simultaneous optimisation of multiple conflicting criteria remains until now almost unexplored in music information retrieval, and it was applied for feature selection in music classification for the first time in this thesis, except for several preliminary own publications. As an exemplarily multi-objective approach for optimisation of feature selection, we simultaneously minimise the classification error and the number of features used for classification. The sets with more features lead to a higher classification quality. On the other side, the sets with fewer features and a lower classification performance may help to strongly decrease the demands for storage and computing time and to reduce the risk of too complex and overfitted classification models. Further, we describe several groups of evaluation criteria and discuss other reasonable multi-objective optimisation scenarios for music data analysis
Advances in Evolutionary Algorithms
With the recent trends towards massive data sets and significant computational power, combined with evolutionary algorithmic advances evolutionary computation is becoming much more relevant to practice. Aim of the book is to present recent improvements, innovative ideas and concepts in a part of a huge EA field
Hybrid approaches to optimization and machine learning methods: a systematic literature review
Notably, real problems are increasingly complex and require sophisticated models and algorithms capable of quickly dealing with large data sets and finding optimal solutions. However, there is no perfect method or algorithm; all of them have some limitations that can be mitigated or eliminated by combining the skills of different methodologies. In this way, it is expected to develop hybrid algorithms that can take advantage of the potential and particularities of each method (optimization and machine learning) to integrate methodologies and make them more efficient. This paper presents an extensive systematic and bibliometric literature review on hybrid methods involving optimization and machine learning techniques for clustering and classification. It aims to identify the potential of methods and algorithms to overcome the difficulties of one or both methodologies when combined. After the description of optimization and machine learning methods, a numerical overview of the works published since 1970 is presented. Moreover, an in-depth state-of-art review over the last three years is presented. Furthermore, a SWOT analysis of the ten most cited algorithms of the collected database is performed, investigating the strengths and weaknesses of the pure algorithms and detaching the opportunities and threats that have been explored with hybrid methods. Thus, with this investigation, it was possible to highlight the most notable works and discoveries involving hybrid methods in terms of clustering and classification and also point out the difficulties of the pure methods and algorithms that can be strengthened through the inspirations of other methodologies; they are hybrid methods.Open access funding provided by FCT|FCCN (b-on). This work has been supported by FCT—
Fundação para a Ciência e Tecnologia within the R &D Units Project Scope: UIDB/00319/2020. Beatriz
Flamia Azevedo is supported by FCT Grant Reference SFRH/BD/07427/2021 The authors are grateful to the
Foundation for Science and Technology (FCT, Portugal) for financial support through national funds FCT/
MCTES (PIDDAC) to CeDRI (UIDB/05757/2020 and UIDP/05757/2020) and SusTEC (LA/P/0007/2021).info:eu-repo/semantics/publishedVersio
A survey of genetic algorithms for multi-label classification
In recent years, multi-label classification (MLC) has become an emerging research topic in big data analytics and machine learning. In this problem, each object of a dataset may belong to multiple class labels and the goal is to learn a classification model that can infer the correct labels of new, previously unseen, objects. This paper presents a survey of genetic algorithms (GAs) designed for MLC tasks. The study is organized in three parts. First, we propose a new taxonomy focused on GAs for MLC. In the second part, we provide an up-to-date overview of the work in this area, categorizing the approaches identified in the literature with respect to the taxonomy. In the third and last part, we discuss some new ideas for combining GAs with MLC
Optimization of feature learning through grammar-guided genetic programming
Tese de Mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasMachine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as
Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods
that can be considered FL methods, such as Neural Networks and PCA, lack interpretability.
A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with
multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL
methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation
(DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process.
In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing
operating arithmetically on categorical features. For example, the multiplication of the postal code of an
individual with their wage is not deemed sensible and thus disallowed.
In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This
method can be used for time series and panel datasets, to aggregate the target value of historic data based
on a known feature value of a new data point. For example, if I want to predict the number of bikes seen
daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore,
DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can
include the average number of bikes seen on past Sundays.
I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently,
DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher
prediction performance for one problem. DKA-M3GP has a much better prediction performance than
M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it
performed well in one of two problems. Overall, my methods show potential for FL.
Both methods are implemented in Genetic Engine an individual-representation-independent GGGP
framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows
competing performance with the mature GGGP framework PonyGE2.A Inteligência Artificial (IA) e o seu subconjunto de Aprendizagem Automática (AA) estão a tornarse mais importantes para nossas vidas a cada dia que passa. Ambas as áreas estão presentes no nosso
dia a dia em diversas aplicações como o reconhecimento automático de voz, os carros autónomos, ou o
reconhecimento de imagens e deteção de objetos. A AA foi aplicada com sucesso em muitas áreas, como
saúde, finanças e marketing.
Num contexto supervisionado, os modelos de AA são treinados com dados e, posteriormente, são usados para prever o comportamento de dados futuros. A combinação de etapas realizadas para construir um
modelo de AA, totalmente treinado e avaliado, é chamada um AA pipeline, ou simplesmente pipeline.
Todos os pipelines seguem etapas obrigatórias, nomeadamente a recuperação, limpeza e manipulação
dos dados, a seleção e construção de features, a seleção do modelo e a otimização dos seus parâmetros,
finalmente, a avaliação do modelo. A construção de AA pipelines é uma tarefa desafiante, com especificidades que dependem do domínio do problema. Existem desafios do lado do design, otimização de
hiperparâmetros, assim como no lado da implementação.
No desenho de pipelines, as escolhas devem ser feitas em relação aos componentes a utilizar e à sua
ordem. Mesmo para especialistas em AA, desenhar pipelines é uma tarefa entediante . As escolhas de
design exigem experiência em AA e um conhecimento do domínio do problema, o que torna a construção
do pipeline num processo intensivo de recursos.
Após o desenho do pipeline, os parâmetros do mesmo devem ser otimizados para melhorar o seu
desempenho. A otimização de parâmetros, geralmente, requer a execução e avaliação sequencial do
pipeline, envolvendo altos custos. No lado da implementação, os programadores podem introduzir bugs
durante o processo de desenvolvimento. Esses bugs podem levar à perda de tempo e dinheiro para serem
corrigidos, e, se não forem detectados, podem comprometer a robustez e correção do modelo ou introduzir
problemas de desempenho. Para contornar esses problemas de design e implementação, surgiu uma nova
linha de investigação designada por AutoML (Automated Machine Learning). AutoML visa automatizar
o desenho de AA pipelines, a otimização de parâmetros, e a sua implementação. Uma parte importante
dos pipelines de AA é a maneira como os features dos dados são manipulados. A manipulação de dados
tem muitos aspetos, reunidos sob o termo genérico Feature Engineering (FE). Em suma, FE visa melhorar
a qualidade do espaço de solução selecionando as features mais importantes e construindo novas features
relevantes. Contudo, este é um processo que consome muitos recursos, pelo que a sua automação é uma
sub-área altamente recompensadora de AutoML. Nesta tese, defino Feature Learning (FL) como a área
de FE automatizado. Uma métrica importante de FE e, portanto, de FL, é a interpretabilidade das features aprendidas. Interpretabilidade, que se enquadra na área de Explainable IA (XIA), refere-se à facilidade de entender o
significado de uma feature. A ocorrência de diversos escândalos em IA, como modelos racistas e sexistas, levaram a União Europeia a propor legislação sobre modelos sem interpretabilidade. Muitos métodos
clássicos, e portanto amplamente usados, carecem de interpretabilidade, dando origem ao interesse recémdescoberto em XIA. A atual investigação em FL trata os valores de features existentes sem os relacionar
com o seu significado semântico. Por exemplo, engenharia de uma feature que representa a multiplicação
do código postal com a idade de uma pessoa não é um uso lógico do código postal. Embora os códigos
postais possam ser representados como números inteiros, eles devem ser tratados como valores categóricos. A prevenção deste tipo de interações entre features, melhora o desempenho do pipeline, uma vez
que reduz o espaço de procura de possíveis features ficando apenas com as que fazem semanticamente
sentido. Além disso, este processo resulta em features que são intrinsecamente interpretáveis. Deste
modo, o conhecimento sobre o domínio do problema, impede a engenharia de features sem significado
durante o processo de FE..
Outro aspecto de FL normalmente não considerado nos métodos existentes, é a agregação de valores
de uma única feature por várias entidades de dados. Por exemplo, vamos considerar um conjunto de
dados sobre fraude de cartão de crédito. A quantidade média de transações anteriores de um cartão
é potencialmente uma feature interessante para incluir, pois transmite o significado de uma transação
’normal’. No entanto, isso geralmente não é diretamente inferível nos métodos de FL existentes. Refirome a este método de FL como agregação de entidades, ou simplesmente agregação.
Por fim, apesar da natureza imprevisível dos conjuntos de dados da vida real, os métodos existentes
exigem principalmente features que tenham dados homogêneos. Isso exige que os cientistas de dados realizem um pré-processamento do conjunto de dados. Muitas vezes, isso requer transformar categorias em
números inteiros ou algum tipo de codificação, como por exemplo one-hot encoding. Contudo, conforme
discutido acima, isso pode reduzir a interpretabilidade e o desempenho do pipeline.
A Programação Genética (GP), um método de ML, é também usado para FL e permite a criação
de modelos mais interpretáveis que a maioria dos métodos tradicionais. GP é um método baseado em
procura que evolui programas ou, no caso de FL, mapeamentos entre apresentas de espaços. Os métodos
de FL baseados em GP existentes não incorporam os três aspectos acima mencionados: o conhecimento
do domínio, a agregação e a conformidade com tipos de dados heterogêneos. Algumas abordagens incorporam algumas partes desses aspetos, principalmente usando gramáticas para orientar o processo de
procura. O objetivo deste trabalho é explorar se a GP consegue usar gramáticas para melhorar a qualidade da FL, quer em termos de desempenho preditivo ou de interpretabilidade. Primeiro, construímos
o Genetic Engine, uma framework de GP guiada por gramática (Grammar-Guided GP (GGGP)). O Genetic Engine é uma framework de GGGP fácil de usar que permite expressar gramáticas complexas.
Mostramos que o Genetic Engine tem um bom desempenho quando comparado com a framework de
Python do estado da arte, PonyGE2.
Em segundo lugar, proponho dois novos métodos de FL baseados em GGGP implementados no Genetic Engine. Ambos os métodos estendem o M3GP, o método FL do estado da arte baseado em GP.
A primeira incorpora o conhecimento do domínio, denominado M3GP com conhecimento do domínio (M3GP with Domain Knowledge (DK-M3GP)). O primeiro método restringe o comportamento das features permitindo apenas interações sensatas, por meio de condições e declarações. O segundo método
estende X DK-M3GP, introduzindo agregação no espaço de procura, e é denominado DK-M3GP com
Agregação (DK-M3GP with Aggregation (DKA-M3GP)). O DKA-M3GP usa totalmente a facilidade de
implementação do Genetic Engine, pois requer a implementação de uma gramática complexa.
Neste trabalho, o DK-M3GP e DKA-M3GP foram avaliados em comparação com o GP Tradicional,
M3GP e numerosos métodos clássicos de FL em dois problemas de ML. As novas abordagens foram
avaliadas assumindo que são métodos autônomos de FL e fazendo parte de uma pipeline maior. Como
métodos FL independentes, ambos os métodos demonstram boa previsão de desempenho em pelo menos
um dos dois problemas. Como parte da pipeline, os métodos apresentam pouca vantagem em relação
aos métodos clássicos no seu desempenho de previsão. Após a análise dos resultados, uma possível
explicação encontra-se no overfitting dos métodos FL para a função de fitness e no conjunto de dados de
treino. O
Neste trabalho, discuto também a melhoria na interpretabilidade após incorporar conhecimento do
domínio no processo de procura. Uma avaliação preliminar do DK-M3GP indica que, utilizando a medida de complexidade Expression Size (ES), é possível obter uma melhoria na interpretabilidade. Todavia,
verifiquei também que a medida de complexidade utilizada pode não ser a mais adequada devido a estrutura de características em forma de árvore das características construídas por DK-M3GP que potencia
um ES. Considero que um método de avaliação de interpretabilidade mais complexo deve apontar isso
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
- …