102 research outputs found
Almost optimal exact distance oracles for planar graphs
We consider the problem of preprocessing a weighted directed planar graph in order to quickly answer exact distance queries. The main tension in this problem is between space S and query time Q, and since the mid-1990s all results had polynomial time-space tradeoffs, e.g., Q = ~ Θ(n/√ S) or Q = ~Θ(n5/2/S3/2).
In this article we show that there is no polynomial tradeoff between time and space and that it is possible to simultaneously achieve almost optimal space n1+o(1) and almost optimal query time no(1). More precisely, we achieve the following space-time tradeoffs:
n1+o(1) space and log2+o(1) n query time,
n log2+o(1) n space and no(1) query time,
n4/3+o(1) space and log1+o(1) n query time.
We reduce a distance query to a variety of point location problems in additively weighted Voronoi diagrams and develop new algorithms for the point location problem itself using several partially persistent dynamic tree data structures
Data Aggregation for Hierarchical Clustering
Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most
flexible clustering method, because it can be used with many distances,
similarities, and various linkage strategies. It is often used when the number
of clusters the data set forms is unknown and some sort of hierarchy in the
data is plausible. Most algorithms for HAC operate on a full distance matrix,
and therefore require quadratic memory. The standard algorithm also has cubic
runtime to produce a full hierarchy. Both memory and runtime are especially
problematic in the context of embedded or otherwise very resource-constrained
systems. In this section, we present how data aggregation with BETULA, a
numerically stable version of the well known BIRCH data aggregation algorithm,
can be used to make HAC viable on systems with constrained resources with only
small losses on clustering quality, and hence allow exploratory data analysis
of very large data sets
Pre-Distribution of Entanglements in Quantum Networks
Quantum network communication is challenging, as the No-Cloning theorem in
quantum regime makes many classical techniques inapplicable. For long-distance
communication, the only viable approach is teleportation of quantum states,
which requires a prior distribution of entangled pairs (EPs) of qubits.
Establishment of EPs across remote nodes can incur significant latency due to
the low probability of success of the underlying physical processes. To reduce
EP generation latency, prior works have looked at selection of efficient
entanglement-routing paths and simultaneous use of multiple such paths for EP
generation. In this paper, we propose and investigate a complementary technique
to reduce EP generation latency--to pre-distribute EPs over certain
(pre-determined) pairs of network nodes; these pre-distributed EPs can then be
used to generate EPs for the requested pairs, when needed, with lower
generation latency. For such an pre-distribution approach to be most effective,
we need to address an optimization problem of selection of node-pairs where the
EPs should be pre-distributed to minimize the generation latency of expected EP
requests, under a given cost constraint. In this paper, we appropriately
formulate the above optimization problem and design two efficient algorithms,
one of which is a greedy approach based on an approximation algorithm for a
special case. Via extensive evaluations over the NetSquid simulator, we
demonstrate the effectiveness of our approach and developed techniques; we show
that our developed algorithms outperform a naive approach by up to an order of
magnitude.Comment: 11 pages, 9 figure
Optimization of feature learning through grammar-guided genetic programming
Tese de Mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasMachine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as
Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods
that can be considered FL methods, such as Neural Networks and PCA, lack interpretability.
A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with
multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL
methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation
(DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process.
In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing
operating arithmetically on categorical features. For example, the multiplication of the postal code of an
individual with their wage is not deemed sensible and thus disallowed.
In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This
method can be used for time series and panel datasets, to aggregate the target value of historic data based
on a known feature value of a new data point. For example, if I want to predict the number of bikes seen
daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore,
DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can
include the average number of bikes seen on past Sundays.
I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently,
DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher
prediction performance for one problem. DKA-M3GP has a much better prediction performance than
M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it
performed well in one of two problems. Overall, my methods show potential for FL.
Both methods are implemented in Genetic Engine an individual-representation-independent GGGP
framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows
competing performance with the mature GGGP framework PonyGE2.A Inteligência Artificial (IA) e o seu subconjunto de Aprendizagem Automática (AA) estão a tornarse mais importantes para nossas vidas a cada dia que passa. Ambas as áreas estão presentes no nosso
dia a dia em diversas aplicações como o reconhecimento automático de voz, os carros autónomos, ou o
reconhecimento de imagens e deteção de objetos. A AA foi aplicada com sucesso em muitas áreas, como
saúde, finanças e marketing.
Num contexto supervisionado, os modelos de AA são treinados com dados e, posteriormente, são usados para prever o comportamento de dados futuros. A combinação de etapas realizadas para construir um
modelo de AA, totalmente treinado e avaliado, é chamada um AA pipeline, ou simplesmente pipeline.
Todos os pipelines seguem etapas obrigatórias, nomeadamente a recuperação, limpeza e manipulação
dos dados, a seleção e construção de features, a seleção do modelo e a otimização dos seus parâmetros,
finalmente, a avaliação do modelo. A construção de AA pipelines é uma tarefa desafiante, com especificidades que dependem do domínio do problema. Existem desafios do lado do design, otimização de
hiperparâmetros, assim como no lado da implementação.
No desenho de pipelines, as escolhas devem ser feitas em relação aos componentes a utilizar e à sua
ordem. Mesmo para especialistas em AA, desenhar pipelines é uma tarefa entediante . As escolhas de
design exigem experiência em AA e um conhecimento do domínio do problema, o que torna a construção
do pipeline num processo intensivo de recursos.
Após o desenho do pipeline, os parâmetros do mesmo devem ser otimizados para melhorar o seu
desempenho. A otimização de parâmetros, geralmente, requer a execução e avaliação sequencial do
pipeline, envolvendo altos custos. No lado da implementação, os programadores podem introduzir bugs
durante o processo de desenvolvimento. Esses bugs podem levar à perda de tempo e dinheiro para serem
corrigidos, e, se não forem detectados, podem comprometer a robustez e correção do modelo ou introduzir
problemas de desempenho. Para contornar esses problemas de design e implementação, surgiu uma nova
linha de investigação designada por AutoML (Automated Machine Learning). AutoML visa automatizar
o desenho de AA pipelines, a otimização de parâmetros, e a sua implementação. Uma parte importante
dos pipelines de AA é a maneira como os features dos dados são manipulados. A manipulação de dados
tem muitos aspetos, reunidos sob o termo genérico Feature Engineering (FE). Em suma, FE visa melhorar
a qualidade do espaço de solução selecionando as features mais importantes e construindo novas features
relevantes. Contudo, este é um processo que consome muitos recursos, pelo que a sua automação é uma
sub-área altamente recompensadora de AutoML. Nesta tese, defino Feature Learning (FL) como a área
de FE automatizado. Uma métrica importante de FE e, portanto, de FL, é a interpretabilidade das features aprendidas. Interpretabilidade, que se enquadra na área de Explainable IA (XIA), refere-se à facilidade de entender o
significado de uma feature. A ocorrência de diversos escândalos em IA, como modelos racistas e sexistas, levaram a União Europeia a propor legislação sobre modelos sem interpretabilidade. Muitos métodos
clássicos, e portanto amplamente usados, carecem de interpretabilidade, dando origem ao interesse recémdescoberto em XIA. A atual investigação em FL trata os valores de features existentes sem os relacionar
com o seu significado semântico. Por exemplo, engenharia de uma feature que representa a multiplicação
do código postal com a idade de uma pessoa não é um uso lógico do código postal. Embora os códigos
postais possam ser representados como números inteiros, eles devem ser tratados como valores categóricos. A prevenção deste tipo de interações entre features, melhora o desempenho do pipeline, uma vez
que reduz o espaço de procura de possíveis features ficando apenas com as que fazem semanticamente
sentido. Além disso, este processo resulta em features que são intrinsecamente interpretáveis. Deste
modo, o conhecimento sobre o domínio do problema, impede a engenharia de features sem significado
durante o processo de FE..
Outro aspecto de FL normalmente não considerado nos métodos existentes, é a agregação de valores
de uma única feature por várias entidades de dados. Por exemplo, vamos considerar um conjunto de
dados sobre fraude de cartão de crédito. A quantidade média de transações anteriores de um cartão
é potencialmente uma feature interessante para incluir, pois transmite o significado de uma transação
’normal’. No entanto, isso geralmente não é diretamente inferível nos métodos de FL existentes. Refirome a este método de FL como agregação de entidades, ou simplesmente agregação.
Por fim, apesar da natureza imprevisível dos conjuntos de dados da vida real, os métodos existentes
exigem principalmente features que tenham dados homogêneos. Isso exige que os cientistas de dados realizem um pré-processamento do conjunto de dados. Muitas vezes, isso requer transformar categorias em
números inteiros ou algum tipo de codificação, como por exemplo one-hot encoding. Contudo, conforme
discutido acima, isso pode reduzir a interpretabilidade e o desempenho do pipeline.
A Programação Genética (GP), um método de ML, é também usado para FL e permite a criação
de modelos mais interpretáveis que a maioria dos métodos tradicionais. GP é um método baseado em
procura que evolui programas ou, no caso de FL, mapeamentos entre apresentas de espaços. Os métodos
de FL baseados em GP existentes não incorporam os três aspectos acima mencionados: o conhecimento
do domínio, a agregação e a conformidade com tipos de dados heterogêneos. Algumas abordagens incorporam algumas partes desses aspetos, principalmente usando gramáticas para orientar o processo de
procura. O objetivo deste trabalho é explorar se a GP consegue usar gramáticas para melhorar a qualidade da FL, quer em termos de desempenho preditivo ou de interpretabilidade. Primeiro, construímos
o Genetic Engine, uma framework de GP guiada por gramática (Grammar-Guided GP (GGGP)). O Genetic Engine é uma framework de GGGP fácil de usar que permite expressar gramáticas complexas.
Mostramos que o Genetic Engine tem um bom desempenho quando comparado com a framework de
Python do estado da arte, PonyGE2.
Em segundo lugar, proponho dois novos métodos de FL baseados em GGGP implementados no Genetic Engine. Ambos os métodos estendem o M3GP, o método FL do estado da arte baseado em GP.
A primeira incorpora o conhecimento do domínio, denominado M3GP com conhecimento do domínio (M3GP with Domain Knowledge (DK-M3GP)). O primeiro método restringe o comportamento das features permitindo apenas interações sensatas, por meio de condições e declarações. O segundo método
estende X DK-M3GP, introduzindo agregação no espaço de procura, e é denominado DK-M3GP com
Agregação (DK-M3GP with Aggregation (DKA-M3GP)). O DKA-M3GP usa totalmente a facilidade de
implementação do Genetic Engine, pois requer a implementação de uma gramática complexa.
Neste trabalho, o DK-M3GP e DKA-M3GP foram avaliados em comparação com o GP Tradicional,
M3GP e numerosos métodos clássicos de FL em dois problemas de ML. As novas abordagens foram
avaliadas assumindo que são métodos autônomos de FL e fazendo parte de uma pipeline maior. Como
métodos FL independentes, ambos os métodos demonstram boa previsão de desempenho em pelo menos
um dos dois problemas. Como parte da pipeline, os métodos apresentam pouca vantagem em relação
aos métodos clássicos no seu desempenho de previsão. Após a análise dos resultados, uma possível
explicação encontra-se no overfitting dos métodos FL para a função de fitness e no conjunto de dados de
treino. O
Neste trabalho, discuto também a melhoria na interpretabilidade após incorporar conhecimento do
domínio no processo de procura. Uma avaliação preliminar do DK-M3GP indica que, utilizando a medida de complexidade Expression Size (ES), é possível obter uma melhoria na interpretabilidade. Todavia,
verifiquei também que a medida de complexidade utilizada pode não ser a mais adequada devido a estrutura de características em forma de árvore das características construídas por DK-M3GP que potencia
um ES. Considero que um método de avaliação de interpretabilidade mais complexo deve apontar isso
ProGReST: Prototypical Graph Regression Soft Trees for Molecular Property Prediction
In this work, we propose the novel Prototypical Graph Regression
Self-explainable Trees (ProGReST) model, which combines prototype learning,
soft decision trees, and Graph Neural Networks. In contrast to other works, our
model can be used to address various challenging tasks, including compound
property prediction. In ProGReST, the rationale is obtained along with
prediction due to the model's built-in interpretability. Additionally, we
introduce a new graph prototype projection to accelerate model training.
Finally, we evaluate PRoGReST on a wide range of chemical datasets for
molecular property prediction and perform in-depth analysis with chemical
experts to evaluate obtained interpretations. Our method achieves competitive
results against state-of-the-art methods.Comment: In the review proces
Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance
Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial
domain. The succession of manias, panics and crashes have stressed the nonstationary
nature and the likelihood of drastic structural changes in financial markets.
The most recent literature suggests the use of conventional machine learning and statistical
approaches for this. However, these techniques are unable or slow to adapt
to non-stationarities and may require re-training over time, which is computationally
expensive and brings financial risks.
This thesis proposes a set of adaptive algorithms to deal with high-frequency data
streams and applies these to the financial domain. We present approaches to handle
different types of concept drifts and perform predictions using up-to-date models.
These mechanisms are designed to provide fast reaction times and are thus applicable
to high-frequency data. The core experiments of this thesis are based on the prediction
of the price movement direction at different intraday resolutions in the SPDR S&P 500
exchange-traded fund. The proposed algorithms are benchmarked against other popular
methods from the data stream mining literature and achieve competitive results.
We believe that this thesis opens good research prospects for financial forecasting
during market instability and structural breaks. Results have shown that our proposed
methods can improve prediction accuracy in many of these scenarios. Indeed, the
results obtained are compatible with ideas against the efficient market hypothesis.
However, we cannot claim that we can beat consistently buy and hold; therefore, we
cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue
Advances in Streaming Novelty Detection
153 p.En primer lugar, en esta tesis se aborda un problema de confusión entre términos y problemas en el cual el mismo término es utilizado para referirse a diferentes problemas y, de manera similar, el mismo problema es llamado con diferentes términos indistintamente. Esto motiva una dificultad de avance en elcampo de conocimiento dado que es difícil encontrar literatura relacionada y propicia la repetición detrabajos. En la primera contribución se propone una asignación individual de términos a problemas y una formalización de los escenarios de aprendizaje para tratar de estandarizar el campo. En segundo lugar, se aborda el problema de Streaming Novelty Detection. En este problema, partiendo de un conjunto de datos supervisado, se aprende un modelo. A continuación, el modelo recibe nuevas instancias no etiquetadas para predecir su clase de manera online o en stream. El modelo debe actualizarse para hacer frente al concept-drift. En este escenario de clasificación, se asume que puedensurgir nuevas clases de forma dinámica. Por lo tanto, el modelo debe ser capaz de descubrir nuevas clases de manera automática y sin supervisión. En este contexto, esta tesis propone 2 contribuciones. En primerlugar una solución basada en mixturas de Guassianas donde cada clase en modelada con una de lascomponentes de la mixtura. En segundo lugar, se propone el uso de redes neuronales, tales como las redes Autoencoder, y las redes Deep Support Vector Data Description para trabajar con serie stemporales
DIVERSIFICATION TRENDS INFERRED FROM THE FOSSIL RECORD
Macroevolution focuses on patterns and processes occurring above the level of species and over geological timescales (Raia, 2016). Investigating diversification processes, both morphological and taxonomical, gives a chance to answer important questions in evolutionary biology. Why do some clades have more species than others? Why do some groups undergo striking adaptive radiations, but others persist for millions of years as living fossils? Why do some groups have much more ecological or morphological diversity than others? Does anything limit the number of species on earth and what? These complex questions share a common underlying feature: all, to some degree, concern rates of macroevolutionary change that occur across geological timescales (Rabosky & Slater, 2014).
The aim of my project was to produce a coherent array of new methods to investigate phenotypic and taxonomic diversification by using phylogenies including extinct species. I started by developing RRphylo, a new phylogenetic comparative method based on phylogenetic ridge regression, which works with a phylogenetic tree and phenotypic data (univariate or multivariate either) to estimate branch-wise rates of phenotypic evolution and ancestral characters simultaneously. The main innovations, which translate in advantages of RRphylo over existing methods, lies in the absence of any a priori hypothesis about the mode of phenotypic evolution and in its ability to deal with fossil phylogenies. Both these factors make RRphylo very suitable to study phenotypic evolution in its different facets.
I further implemented RRphylo to locate clade- or state-related shifts in absolute rates of phenotypic evolution, to integrate the effect of additional (to the phylogeny) predictors on rates estimation, to identify temporal trends in phenotypic mean and evolutionary rates occurring on the entire tree or pertaining individual clades, to identify instances of morphological convergence, to include ancestral character information derived from the fossil record, and to work with discrete variables.
All these tools are collected into the RRphylo R package, online from April 2018, and counting > 14000 downloads on CRAN to date. I have been handling the maintenance and updates of the RRphylo package for both the release (https://cran.r-project.org/web/packages/RRphylo/index.html) and development (https://github.com/pasraia/RRphylo) versions, and creating/updating explanatory vignettes to facilitate its usage
Analysis of Students' Programming Knowledge and Error Development
Programmieren zu lernen ist für viele eine große Herausforderung, da es unterschiedliche Fähigkeiten erfordert. Man muss nicht nur die Programmiersprache und deren Konzepte kennen, sondern es erfordert auch spezifisches Domänenwissen und eine gewisse
Problemlösekompetenz. Wissen darüber, wie sich die Programmierkenntnisse Studierender entwickeln und welche Schwierigkeiten sie haben, kann dabei helfen, geeignete Lehrstrategien zu entwickeln. Durch die immer weiter steigenden Studierendenzahlen wird es jedoch
zunehmend schwieriger für Lehrkräfte, die Bedürfnisse, Probleme und Schwierigkeiten der Studierenden zu erkennen.
Das Ziel dieser Arbeit ist es, Einblick in die Entwicklung von Programmierkenntnissen der Studierenden anhand ihrer Lösungen zu Programmieraufgaben zu gewinnen. Wissen setzt sich aus sogenannten Wissenskomponenten zusammen. In dieser Arbeit fokussieren
wir uns auf syntaktische Wissenskomponen, die aus abstrakten Syntaxbäumen abgeleitet werden können, und semantische Wissenskomponenten, die durch sogenannte Variablenrollen repräsentiert werden.
Da Wissen an sich nicht direkt messbar ist, werden häufig Skill-Modelle verwendet, um den Kenntnissstand abzuschätzen. Jedoch hat die Programmierdomäne ihre eigenen speziellen Eigenschaften, die bei der Wahl eines geeigneten Skill-Modells berücksichtigt werden
müssen. Eine der Haupteigenschaften in der Programmierung ist, dass die Wissenskomponenten nicht unabhängig voneinander sind. Aus diesem Grund schlagen wir ein dynamisches Bayesnetz (DBN) als Skill-Modell vor, da es erlaubt, diese Abhängigkeiten explizit zu
modellieren. Neben derWahl eines passenden Skill-Modells, müssen auch bestimmte Meta-Parameter wie beispielsweise die Granularität der Wissenkomponenten festgelegt werden. Daher evaluieren wir, wie sich die Wahl von Meta-Parameters auf die Vorhersagequalität
von Skill-Modellen auswirkt und wie diese Meta-Parameter gewählt werden sollten. Wir nutzen das DBN, um Lernkurven für jede Wissenskomponenten zu ermitteln und daraus Implikationen für die Lehre abzuleiten.
Nicht nur das Wissen von Studierenden, sondern auch deren “Falsch”-Wissen ist von Bedeutung. Deswegen untersuchen wir zunächst manuell sämtliche Programmierfehler der Studierenden und bestimmen deren Häufigkeit, Dauer und Wiederkehrrate. Wir unterscheiden
dabei zwischen den Fehlerkategorien syntaktisch, konzeptuell, strategisch, Nachlässigkeit, Fehlinterpretation und Domäne und schauen, wie sich die Fehler über die Zeit entwickeln. Außerdem verwenden wir k-means-Clustering um potentielle Muster in der Fehlerentwicklung
zu finden.
Die Ergebnisse unserer Fallstudien sind vielversprechend. Wir können zeigen, dass die Wahl der Meta-Parameter einen großen Einfluss auf die Vorhersagequalität von Modellen hat. Außerdem ist unser DBN vergleichbar leistungsstark wie andere Skill-Modelle, ist gleichzeitig aber besser zu interpretieren. Die Lernkurven der Wissenskomponenten und die Analyse der Programmierfehler liefern uns wertvolle Erkenntnisse, die der Kursverbesserung helfen können, z.B. dass die Studierenden mehr Übungsaufgaben benötigen oder mit welchen Konzepten sie Schwierigkeiten haben.Learning to program is a hard task since it involves different types of specialized knowledge. You do not only need knowledge about the programming language and its concepts, but also knowledge from the problem domain and general problem solving abilities. Knowing how students develop programming knowledge and where they struggle, may help in the development of suitable teaching strategies. However, the ever increasing number of students makes it more and more difficult for educators to identify students’ needs, problems, and deficiencies.
The goal of the thesis is to gain insights into students programming knowledge development based on their solutions to programming exercises. Knowledge is composed of so called knowledge components (KCs). In this thesis, we focus on KCs on a syntactic level, which can be derived from abstract systax trees, e.g., loops, comparison, etc., and semantic level, represented by so called roles of variables.
Since knowledge is not directly measurable, skill models are an often used for the estimation of knowledge. But, the programming domain has its own characteristics which have to be considered when selecting an appropriate skill model. One of the main characteristics of the programming domain are the dependencies between KCs. Hence, we propose and evaluate a Dynamic Bayesian Network (DBN) for skill modeling which allows to model that dependencies explicitly. Besides the choice of a concrete model, also certain metaparameters like, e.g., the granularity level of KCs, has to be set when designing a skill
model. Therefore, we evaluate how meta-parameterization affects the prediction performance of skill models and which meta-parameters to choose. We use the DBN to create learning curves for each KC and deduce implications for teaching from them.
But not only students knowledge but also their “mal-knowledge” is of importance. Therefore, we manually inspect students’ programming errors and determine the error’s frequency, duration, and re-occurrence. We distinguish between the error categories syntactic, conceptual, strategic, sloppiness, misinterpretation, and domain and analyze how the errors change over time. Moreover, we use k-means clustering to identify different patterns in the development of programming errors.
The results of our case studies are promising. We show that the correct metaparameterization has a huge effect on the prediction performance of skill models. In
addition, our DBN performs as well as the other skill models while providing better interpretability. The learning curves of KCs and the analysis of programming errors provide
valuable information which can be used for course improvement, e.g., that students require
more practice opportunities or are struggling with certain concepts.2022-02-0
- …