8,604 research outputs found
Dynamic Feature Engineering and model selection methods for temporal tabular datasets with regime changes
The application of deep learning algorithms to temporal panel datasets is
difficult due to heavy non-stationarities which can lead to over-fitted models
that under-perform under regime changes. In this work we propose a new machine
learning pipeline for ranking predictions on temporal panel datasets which is
robust under regime changes of data. Different machine-learning models,
including Gradient Boosting Decision Trees (GBDTs) and Neural Networks with and
without simple feature engineering are evaluated in the pipeline with different
settings. We find that GBDT models with dropout display high performance,
robustness and generalisability with relatively low complexity and reduced
computational cost. We then show that online learning techniques can be used in
post-prediction processing to enhance the results. In particular, dynamic
feature neutralisation, an efficient procedure that requires no retraining of
models and can be applied post-prediction to any machine learning model,
improves robustness by reducing drawdown in regime changes. Furthermore, we
demonstrate that the creation of model ensembles through dynamic model
selection based on recent model performance leads to improved performance over
baseline by improving the Sharpe and Calmar ratios of out-of-sample prediction
performances. We also evaluate the robustness of our pipeline across different
data splits and random seeds with good reproducibility of results
Evolving Decision Rules with Geometric Semantic Genetic Programming
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDue to the ever increasing amount of data available in today’s world, a variety of
methods to harness this information are continuously being created, refined and
utilized, drawing inspiration from a multitude of sources. Relevant to this work are
Supervised Learning techniques, that attempt to discover the relationship between the
characteristics of data and a certain feature, to uncover the function that maps input
to output. Among these, Genetic Programming (GP) attempts to replicate the concept
of evolution as defined by Charles Darwin, mimicking natural selection and genetic
operators to generate and improve a population of solutions for a given prediction
problem.
Among the possible variants of GP, Geometric Semantic Genetic Programming
(GSGP) stands out, due to its focus on the meaning of each individual it creates, rather
than their structure. It achieves by imagining an hypothetical and perfect model, and
evaluating the performance of others by measuring how much their behaviour differ
from it, and uses a set of genetic operators that have a specific effect on the individual’s
semantics (i.e., its predictions for training data), with the goal of reaching ever closer
to the so called perfect specimen.
This thesis conceptualizes and evaluates the performance of aGSGPimplementation
made specifically to deal with multi-class classification problems, using tree-based
individuals that are composed by a set of rules to allow the categorization of data. This
is achieved through the careful translation of GSGP’s theoretical foundation, first into
algorithms and then into an actual code library, able to tackle problems of this domain.
The results demonstrate that the implementation works successfully and respects the
properties of the the original technique, allowing us to obtain excellent results on
training data, although performance on unseen data is a slightly worse than that of
other state-of-the-art algorithms.Devido à crescente quantidade de dados do mundo de hoje, uma variedade de métodos
para utilizar esta informação é continuamente criada, melhorada e utilizado, com
inspiração de diversas fontes. Com particular relevância para este trabalho são técnicas
de Supervised Learning, que visam descobrir a relação entre as características dos
dados e um traço específico destes, de modo a encontrar uma função que consiga
mapear os inputs aos outputs. Entre estas, Programação Genética (PG) tenta recriar o
conceito de evolução como definido por Charles Darwin, imitando a seleção natural e
operadores genéticos para gerar e melhorar uma população de soluções para um dado
problema preditivo.
Entre as possíveis variantes de PG, Programação Genética em Geometria Semântica
(PGGS) é notável, pois coloca o seu foco no significado de cada indivíduo que cria,
em vez da sua estrutura. Realiza isto ao imaginar um modelo hipotético e perfeito,
e avaliar as capacidades dos outros medindo o quão diferente o seu comportamento
difere deste, e utiliza um conjunto de operadores genéticos com um efeito específico
na semântica de um indíviduo (i.e., as suas previsões para dados de treino), visando
chegar cada vez mais perto ao tão chamado espécime perfeito.
Esta tese conceptualiza e avalia o desempenho de uma implementação de PGGS
feita especificamente para lidar com problemas de classificação multi-classe, utilizando
indivíduos baseados em árvores compostos por uma série de regras que permitem a
categorização de dados. Isto é feito através de uma tradução cuidadosa da base teórica
de PGGS, primeiro para algoritmos e depois para uma biblioteca de código, capaz de
enfrentar problemas deste domínio. Os resultados demonstram que a implementação
funciona corretamente e respeita as propriedades da técnica original, permitindo que
obtivéssemos resultados excelentes nos dados de treino, embora o desempenho em
dados não vistos seja ligeiramente abaixo de outros algoritmos de última geração
H-TSP: Hierarchically Solving the Large-Scale Travelling Salesman Problem
We propose an end-to-end learning framework based on hierarchical
reinforcement learning, called H-TSP, for addressing the large-scale Travelling
Salesman Problem (TSP). The proposed H-TSP constructs a solution of a TSP
instance starting from the scratch relying on two components: the upper-level
policy chooses a small subset of nodes (up to 200 in our experiment) from all
nodes that are to be traversed, while the lower-level policy takes the chosen
nodes as input and outputs a tour connecting them to the existing partial route
(initially only containing the depot). After jointly training the upper-level
and lower-level policies, our approach can directly generate solutions for the
given TSP instances without relying on any time-consuming search procedures. To
demonstrate effectiveness of the proposed approach, we have conducted extensive
experiments on randomly generated TSP instances with different numbers of
nodes. We show that H-TSP can achieve comparable results (gap 3.42% vs. 7.32%)
as SOTA search-based approaches, and more importantly, we reduce the time
consumption up to two orders of magnitude (3.32s vs. 395.85s). To the best of
our knowledge, H-TSP is the first end-to-end deep reinforcement learning
approach that can scale to TSP instances of up to 10000 nodes. Although there
are still gaps to SOTA results with respect to solution quality, we believe
that H-TSP will be useful for practical applications, particularly those that
are time-sensitive e.g., on-call routing and ride hailing service.Comment: Accepted by AAAI 2023, February 202
Machine Learning Research Trends in Africa: A 30 Years Overview with Bibliometric Analysis Review
In this paper, a critical bibliometric analysis study is conducted, coupled
with an extensive literature survey on recent developments and associated
applications in machine learning research with a perspective on Africa. The
presented bibliometric analysis study consists of 2761 machine learning-related
documents, of which 98% were articles with at least 482 citations published in
903 journals during the past 30 years. Furthermore, the collated documents were
retrieved from the Science Citation Index EXPANDED, comprising research
publications from 54 African countries between 1993 and 2021. The bibliometric
study shows the visualization of the current landscape and future trends in
machine learning research and its application to facilitate future
collaborative research and knowledge exchange among authors from different
research institutions scattered across the African continent
Unsupervised inference methods for protein sequence data
L'abstract è presente nell'allegato / the abstract is in the attachmen
Runtime Analysis of Success-Based Parameter Control Mechanisms for Evolutionary Algorithms on Multimodal Problems
Evolutionary algorithms are simple general-purpose optimisers often used to solve complex engineering and design problems. They mimic the process of natural evolution: they use a population of possible solutions to a problem that evolves by mutating and recombining solutions, identifying increasingly better solutions over time. Evolutionary algorithms have been applied to a broad range of problems in various disciplines with remarkable success. However, the reasons behind their success are often elusive: their performance often depends crucially, and unpredictably, on their parameter settings. It is, furthermore, well known that there are no globally good parameters, that is, the correct parameters for one problem may differ substantially to the parameters needed for another, making it harder to translate previous successfully implemented parameters to new problems. Therefore, understanding how to properly select the parameters is an important but challenging task. This is commonly known as the parameter selection problem.
A promising solution to this problem is the use of automated dynamic parameter selection schemes (parameter control) that allow evolutionary algorithms to identify and continuously track optimal parameters throughout the course of evolution without human intervention. In recent years the study of parameter control mechanisms in evolutionary algorithms has emerged as a very fruitful research area. However, most existing runtime analyses focus on simple problems with benign characteristics, for which fixed parameter settings already run efficiently and only moderate performance gains were shown. The aim of this thesis is to
understand how parameter control mechanisms can be used on more complex and challenging problems with many local optima (multimodal problems) to speed up optimisation.
We use advanced methods from the analysis of algorithms and probability theory to evaluate the performance of evolutionary algorithms, estimating the expected time until an algorithm finds satisfactory solutions for illustrative and relevant optimisation problems as a vital stepping stone towards designing more efficient evolutionary algorithms. We first analyse current parameter control mechanisms on multimodal problems to understand their strengths and weaknesses. Subsequently we use this knowledge to design parameter control mechanisms that mitigate the weaknesses of current mechanisms while maintaining their strengths. Finally, we show with theoretical and empirical analyses that these enhanced parameter control mechanisms are able to outperform the best fixed parameter settings on multimodal optimisation
Introduction to Online Nonstochastic Control
This text presents an introduction to an emerging paradigm in control of
dynamical systems and differentiable reinforcement learning called online
nonstochastic control. The new approach applies techniques from online convex
optimization and convex relaxations to obtain new methods with provable
guarantees for classical settings in optimal and robust control.
The primary distinction between online nonstochastic control and other
frameworks is the objective. In optimal control, robust control, and other
control methodologies that assume stochastic noise, the goal is to perform
comparably to an offline optimal strategy. In online nonstochastic control,
both the cost functions as well as the perturbations from the assumed dynamical
model are chosen by an adversary. Thus the optimal policy is not defined a
priori. Rather, the target is to attain low regret against the best policy in
hindsight from a benchmark class of policies.
This objective suggests the use of the decision making framework of online
convex optimization as an algorithmic methodology. The resulting methods are
based on iterative mathematical optimization algorithms, and are accompanied by
finite-time regret and computational complexity guarantees.Comment: Draft; comments/suggestions welcome at
[email protected]
Efficient Learning of Mesh-Based Physical Simulation with BSMS-GNN
Learning the physical simulation on large-scale meshes with flat Graph Neural
Networks (GNNs) and stacking Message Passings (MPs) is challenging due to the
scaling complexity w.r.t. the number of nodes and over-smoothing. There has
been growing interest in the community to introduce \textit{multi-scale}
structures to GNNs for physical simulation. However, current state-of-the-art
methods are limited by their reliance on the labor-intensive drawing of coarser
meshes or building coarser levels based on spatial proximity, which can
introduce wrong edges across geometry boundaries. Inspired by the bipartite
graph determination, we propose a novel pooling strategy, \textit{bi-stride} to
tackle the aforementioned limitations. Bi-stride pools nodes on every other
frontier of the breadth-first search (BFS), without the need for the manual
drawing of coarser meshes and avoiding the wrong edges by spatial proximity.
Additionally, it enables a one-MP scheme per level and non-parametrized pooling
and unpooling by interpolations, resembling U-Nets, which significantly reduces
computational costs. Experiments show that the proposed framework,
\textit{BSMS-GNN}, significantly outperforms existing methods in terms of both
accuracy and computational efficiency in representative physical simulations.Comment: Updates summary: * update to the accepted version ICM
Acoustic modelling, data augmentation and feature extraction for in-pipe machine learning applications
Gathering measurements from infrastructure, private premises, and harsh environments can be difficult and expensive. From this perspective, the development of
new machine learning algorithms is strongly affected by the availability of training
and test data. We focus on audio archives for in-pipe events. Although several
examples of pipe-related applications can be found in the literature, datasets of
audio/vibration recordings are much scarcer, and the only references found relate
to leakage detection and characterisation. Therefore, this work proposes a methodology to relieve the burden of data collection for acoustic events in deployed pipes.
The aim is to maximise the yield of small sets of real recordings and demonstrate
how to extract effective features for machine learning. The methodology developed
requires the preliminary creation of a soundbank of audio samples gathered with
simple weak annotations. For practical reasons, the case study is given by a range
of appliances, fittings, and fixtures connected to pipes in domestic environments.
The source recordings are low-reverberated audio signals enhanced through a
bespoke spectral filter and containing the desired audio fingerprints. The soundbank is then processed to create an arbitrary number of synthetic augmented
observations. The data augmentation improves the quality and the quantity of
the metadata and automatically creates strong and accurate annotations that
are both machine and human-readable. Besides, the implemented processing
chain allows precise control of properties such as signal-to-noise ratio, duration
of the events, and the number of overlapping events. The inter-class variability
is expanded by recombining source audio blocks and adding simulated artificial
reverberation obtained through an acoustic model developed for the purpose.
Finally, the dataset is synthesised to guarantee separability and balance. A few
signal representations are optimised to maximise the classification performance,
and the results are reported as a benchmark for future developments. The contribution to the existing knowledge concerns several aspects of the processing chain
implemented. A novel quasi-analytic acoustic model is introduced to simulate
in-pipe reverberations, adopting a three-layer architecture particularly convenient
for batch processing. The first layer includes two algorithms: one for the numerical
calculation of the axial wavenumbers and one for the separation of the modes. The
latter, in particular, provides a workaround for a problem not explicitly treated in the
literature and related to the modal non-orthogonality given by the solid-liquid interface in the analysed domain. A set of results for different waveguides is reported
to compare the dispersive behaviour against different mechanical configurations.
Two more novel solutions are also included in the second layer of the model and
concern the integration of the acoustic sources. Specifically, the amplitudes of the
non-orthogonal modal potentials are obtained using either a distance minimisation
objective function or by solving an analytical decoupling problem. In both cases,
results show that sources sufficiently smooth can be approximated with a limited
number of modes keeping the error below 1%. The last layer proposes a bespoke
approach for the integration of the acoustic model into the synthesiser as a reverberation simulator. Additional elements of novelty relate to the other blocks of the
audio synthesiser. The statistical spectral filter, for instance, is a batch-processing
solution for the attenuation of the background noise of the source recordings. The
signal-to-noise ratio analysis for both moderate and high noise levels indicates
a clear improvement of several decibels against the closest filter example in the
literature. The recombination of the audio blocks and the system of fully tracked
annotations are also novel extensions of similar approaches recently adopted in
other contexts. Moreover, a bespoke synthesis strategy is proposed to guarantee
separable and balanced datasets. The last contribution concerns the extraction
of convenient sets of audio features. Elements of novelty are introduced for the
optimisation of the filter banks of the mel-frequency cepstral coefficients and the
scattering wavelet transform. In particular, compared to the respective standard
definitions, the average F-score performance of the optimised features is roughly
6% higher in the first case and 2.5% higher for the latter. Finally, the soundbank,
the synthetic dataset, and the fundamental blocks of the software library developed
are publicly available for further research
The Future of Work and Digital Skills
The theme for the events was "The Future of Work and Digital Skills". The 4IR caused a
hollowing out of middle-income jobs (Frey & Osborne, 2017) but COVID-19 exposed the digital gap as survival depended mainly on digital infrastructure and connectivity. Almost overnight, organizations that had not invested in a digital strategy suddenly realized the need for such a strategy and the associated digital skills. The effects have been profound for those who struggled to adapt, while those who stepped up have reaped quite the reward.Therefore, there are no longer certainties about what the world will look like in a few years from now. However, there are certain ways to anticipate the changes that are occurring and plan on how to continually adapt to an increasingly changing world. Certain jobs will soon be lost and will not come back; other new jobs will however be created. Using data science and other predictive sciences, it is possible to anticipate, to the extent possible, the rate at which certain jobs will be replaced and new jobs created in different industries. Accordingly, the collocated events sought to bring together government, international organizations, academia, industry, organized labour and civil society to deliberate on how these changes are occurring in South Africa, how fast they are occurring and what needs to change in order to prepare society for the changes.Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)
British High Commission (BHC)School of Computin
- …