102 research outputs found

    Almost optimal exact distance oracles for planar graphs

    Get PDF
    We consider the problem of preprocessing a weighted directed planar graph in order to quickly answer exact distance queries. The main tension in this problem is between space S and query time Q, and since the mid-1990s all results had polynomial time-space tradeoffs, e.g., Q = ~ Θ(n/√ S) or Q = ~Θ(n5/2/S3/2). In this article we show that there is no polynomial tradeoff between time and space and that it is possible to simultaneously achieve almost optimal space n1+o(1) and almost optimal query time no(1). More precisely, we achieve the following space-time tradeoffs: n1+o(1) space and log2+o(1) n query time, n log2+o(1) n space and no(1) query time, n4/3+o(1) space and log1+o(1) n query time. We reduce a distance query to a variety of point location problems in additively weighted Voronoi diagrams and develop new algorithms for the point location problem itself using several partially persistent dynamic tree data structures

    Data Aggregation for Hierarchical Clustering

    Full text link
    Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large data sets

    Pre-Distribution of Entanglements in Quantum Networks

    Full text link
    Quantum network communication is challenging, as the No-Cloning theorem in quantum regime makes many classical techniques inapplicable. For long-distance communication, the only viable approach is teleportation of quantum states, which requires a prior distribution of entangled pairs (EPs) of qubits. Establishment of EPs across remote nodes can incur significant latency due to the low probability of success of the underlying physical processes. To reduce EP generation latency, prior works have looked at selection of efficient entanglement-routing paths and simultaneous use of multiple such paths for EP generation. In this paper, we propose and investigate a complementary technique to reduce EP generation latency--to pre-distribute EPs over certain (pre-determined) pairs of network nodes; these pre-distributed EPs can then be used to generate EPs for the requested pairs, when needed, with lower generation latency. For such an pre-distribution approach to be most effective, we need to address an optimization problem of selection of node-pairs where the EPs should be pre-distributed to minimize the generation latency of expected EP requests, under a given cost constraint. In this paper, we appropriately formulate the above optimization problem and design two efficient algorithms, one of which is a greedy approach based on an approximation algorithm for a special case. Via extensive evaluations over the NetSquid simulator, we demonstrate the effectiveness of our approach and developed techniques; we show that our developed algorithms outperform a naive approach by up to an order of magnitude.Comment: 11 pages, 9 figure

    Optimization of feature learning through grammar-guided genetic programming

    Get PDF
    Tese de Mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasMachine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods that can be considered FL methods, such as Neural Networks and PCA, lack interpretability. A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation (DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process. In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing operating arithmetically on categorical features. For example, the multiplication of the postal code of an individual with their wage is not deemed sensible and thus disallowed. In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This method can be used for time series and panel datasets, to aggregate the target value of historic data based on a known feature value of a new data point. For example, if I want to predict the number of bikes seen daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore, DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can include the average number of bikes seen on past Sundays. I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently, DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher prediction performance for one problem. DKA-M3GP has a much better prediction performance than M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it performed well in one of two problems. Overall, my methods show potential for FL. Both methods are implemented in Genetic Engine an individual-representation-independent GGGP framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows competing performance with the mature GGGP framework PonyGE2.A Inteligência Artificial (IA) e o seu subconjunto de Aprendizagem Automática (AA) estão a tornarse mais importantes para nossas vidas a cada dia que passa. Ambas as áreas estão presentes no nosso dia a dia em diversas aplicações como o reconhecimento automático de voz, os carros autónomos, ou o reconhecimento de imagens e deteção de objetos. A AA foi aplicada com sucesso em muitas áreas, como saúde, finanças e marketing. Num contexto supervisionado, os modelos de AA são treinados com dados e, posteriormente, são usados para prever o comportamento de dados futuros. A combinação de etapas realizadas para construir um modelo de AA, totalmente treinado e avaliado, é chamada um AA pipeline, ou simplesmente pipeline. Todos os pipelines seguem etapas obrigatórias, nomeadamente a recuperação, limpeza e manipulação dos dados, a seleção e construção de features, a seleção do modelo e a otimização dos seus parâmetros, finalmente, a avaliação do modelo. A construção de AA pipelines é uma tarefa desafiante, com especificidades que dependem do domínio do problema. Existem desafios do lado do design, otimização de hiperparâmetros, assim como no lado da implementação. No desenho de pipelines, as escolhas devem ser feitas em relação aos componentes a utilizar e à sua ordem. Mesmo para especialistas em AA, desenhar pipelines é uma tarefa entediante . As escolhas de design exigem experiência em AA e um conhecimento do domínio do problema, o que torna a construção do pipeline num processo intensivo de recursos. Após o desenho do pipeline, os parâmetros do mesmo devem ser otimizados para melhorar o seu desempenho. A otimização de parâmetros, geralmente, requer a execução e avaliação sequencial do pipeline, envolvendo altos custos. No lado da implementação, os programadores podem introduzir bugs durante o processo de desenvolvimento. Esses bugs podem levar à perda de tempo e dinheiro para serem corrigidos, e, se não forem detectados, podem comprometer a robustez e correção do modelo ou introduzir problemas de desempenho. Para contornar esses problemas de design e implementação, surgiu uma nova linha de investigação designada por AutoML (Automated Machine Learning). AutoML visa automatizar o desenho de AA pipelines, a otimização de parâmetros, e a sua implementação. Uma parte importante dos pipelines de AA é a maneira como os features dos dados são manipulados. A manipulação de dados tem muitos aspetos, reunidos sob o termo genérico Feature Engineering (FE). Em suma, FE visa melhorar a qualidade do espaço de solução selecionando as features mais importantes e construindo novas features relevantes. Contudo, este é um processo que consome muitos recursos, pelo que a sua automação é uma sub-área altamente recompensadora de AutoML. Nesta tese, defino Feature Learning (FL) como a área de FE automatizado. Uma métrica importante de FE e, portanto, de FL, é a interpretabilidade das features aprendidas. Interpretabilidade, que se enquadra na área de Explainable IA (XIA), refere-se à facilidade de entender o significado de uma feature. A ocorrência de diversos escândalos em IA, como modelos racistas e sexistas, levaram a União Europeia a propor legislação sobre modelos sem interpretabilidade. Muitos métodos clássicos, e portanto amplamente usados, carecem de interpretabilidade, dando origem ao interesse recémdescoberto em XIA. A atual investigação em FL trata os valores de features existentes sem os relacionar com o seu significado semântico. Por exemplo, engenharia de uma feature que representa a multiplicação do código postal com a idade de uma pessoa não é um uso lógico do código postal. Embora os códigos postais possam ser representados como números inteiros, eles devem ser tratados como valores categóricos. A prevenção deste tipo de interações entre features, melhora o desempenho do pipeline, uma vez que reduz o espaço de procura de possíveis features ficando apenas com as que fazem semanticamente sentido. Além disso, este processo resulta em features que são intrinsecamente interpretáveis. Deste modo, o conhecimento sobre o domínio do problema, impede a engenharia de features sem significado durante o processo de FE.. Outro aspecto de FL normalmente não considerado nos métodos existentes, é a agregação de valores de uma única feature por várias entidades de dados. Por exemplo, vamos considerar um conjunto de dados sobre fraude de cartão de crédito. A quantidade média de transações anteriores de um cartão é potencialmente uma feature interessante para incluir, pois transmite o significado de uma transação ’normal’. No entanto, isso geralmente não é diretamente inferível nos métodos de FL existentes. Refirome a este método de FL como agregação de entidades, ou simplesmente agregação. Por fim, apesar da natureza imprevisível dos conjuntos de dados da vida real, os métodos existentes exigem principalmente features que tenham dados homogêneos. Isso exige que os cientistas de dados realizem um pré-processamento do conjunto de dados. Muitas vezes, isso requer transformar categorias em números inteiros ou algum tipo de codificação, como por exemplo one-hot encoding. Contudo, conforme discutido acima, isso pode reduzir a interpretabilidade e o desempenho do pipeline. A Programação Genética (GP), um método de ML, é também usado para FL e permite a criação de modelos mais interpretáveis que a maioria dos métodos tradicionais. GP é um método baseado em procura que evolui programas ou, no caso de FL, mapeamentos entre apresentas de espaços. Os métodos de FL baseados em GP existentes não incorporam os três aspectos acima mencionados: o conhecimento do domínio, a agregação e a conformidade com tipos de dados heterogêneos. Algumas abordagens incorporam algumas partes desses aspetos, principalmente usando gramáticas para orientar o processo de procura. O objetivo deste trabalho é explorar se a GP consegue usar gramáticas para melhorar a qualidade da FL, quer em termos de desempenho preditivo ou de interpretabilidade. Primeiro, construímos o Genetic Engine, uma framework de GP guiada por gramática (Grammar-Guided GP (GGGP)). O Genetic Engine é uma framework de GGGP fácil de usar que permite expressar gramáticas complexas. Mostramos que o Genetic Engine tem um bom desempenho quando comparado com a framework de Python do estado da arte, PonyGE2. Em segundo lugar, proponho dois novos métodos de FL baseados em GGGP implementados no Genetic Engine. Ambos os métodos estendem o M3GP, o método FL do estado da arte baseado em GP. A primeira incorpora o conhecimento do domínio, denominado M3GP com conhecimento do domínio (M3GP with Domain Knowledge (DK-M3GP)). O primeiro método restringe o comportamento das features permitindo apenas interações sensatas, por meio de condições e declarações. O segundo método estende X DK-M3GP, introduzindo agregação no espaço de procura, e é denominado DK-M3GP com Agregação (DK-M3GP with Aggregation (DKA-M3GP)). O DKA-M3GP usa totalmente a facilidade de implementação do Genetic Engine, pois requer a implementação de uma gramática complexa. Neste trabalho, o DK-M3GP e DKA-M3GP foram avaliados em comparação com o GP Tradicional, M3GP e numerosos métodos clássicos de FL em dois problemas de ML. As novas abordagens foram avaliadas assumindo que são métodos autônomos de FL e fazendo parte de uma pipeline maior. Como métodos FL independentes, ambos os métodos demonstram boa previsão de desempenho em pelo menos um dos dois problemas. Como parte da pipeline, os métodos apresentam pouca vantagem em relação aos métodos clássicos no seu desempenho de previsão. Após a análise dos resultados, uma possível explicação encontra-se no overfitting dos métodos FL para a função de fitness e no conjunto de dados de treino. O Neste trabalho, discuto também a melhoria na interpretabilidade após incorporar conhecimento do domínio no processo de procura. Uma avaliação preliminar do DK-M3GP indica que, utilizando a medida de complexidade Expression Size (ES), é possível obter uma melhoria na interpretabilidade. Todavia, verifiquei também que a medida de complexidade utilizada pode não ser a mais adequada devido a estrutura de características em forma de árvore das características construídas por DK-M3GP que potencia um ES. Considero que um método de avaliação de interpretabilidade mais complexo deve apontar isso

    ProGReST: Prototypical Graph Regression Soft Trees for Molecular Property Prediction

    Full text link
    In this work, we propose the novel Prototypical Graph Regression Self-explainable Trees (ProGReST) model, which combines prototype learning, soft decision trees, and Graph Neural Networks. In contrast to other works, our model can be used to address various challenging tasks, including compound property prediction. In ProGReST, the rationale is obtained along with prediction due to the model's built-in interpretability. Additionally, we introduce a new graph prototype projection to accelerate model training. Finally, we evaluate PRoGReST on a wide range of chemical datasets for molecular property prediction and perform in-depth analysis with chemical experts to evaluate obtained interpretations. Our method achieves competitive results against state-of-the-art methods.Comment: In the review proces

    Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the nonstationary nature and the likelihood of drastic structural changes in financial markets. The most recent literature suggests the use of conventional machine learning and statistical approaches for this. However, these techniques are unable or slow to adapt to non-stationarities and may require re-training over time, which is computationally expensive and brings financial risks. This thesis proposes a set of adaptive algorithms to deal with high-frequency data streams and applies these to the financial domain. We present approaches to handle different types of concept drifts and perform predictions using up-to-date models. These mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The core experiments of this thesis are based on the prediction of the price movement direction at different intraday resolutions in the SPDR S&P 500 exchange-traded fund. The proposed algorithms are benchmarked against other popular methods from the data stream mining literature and achieve competitive results. We believe that this thesis opens good research prospects for financial forecasting during market instability and structural breaks. Results have shown that our proposed methods can improve prediction accuracy in many of these scenarios. Indeed, the results obtained are compatible with ideas against the efficient market hypothesis. However, we cannot claim that we can beat consistently buy and hold; therefore, we cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue

    Advances in Streaming Novelty Detection

    Get PDF
    153 p.En primer lugar, en esta tesis se aborda un problema de confusión entre términos y problemas en el cual el mismo término es utilizado para referirse a diferentes problemas y, de manera similar, el mismo problema es llamado con diferentes términos indistintamente. Esto motiva una dificultad de avance en elcampo de conocimiento dado que es difícil encontrar literatura relacionada y propicia la repetición detrabajos. En la primera contribución se propone una asignación individual de términos a problemas y una formalización de los escenarios de aprendizaje para tratar de estandarizar el campo. En segundo lugar, se aborda el problema de Streaming Novelty Detection. En este problema, partiendo de un conjunto de datos supervisado, se aprende un modelo. A continuación, el modelo recibe nuevas instancias no etiquetadas para predecir su clase de manera online o en stream. El modelo debe actualizarse para hacer frente al concept-drift. En este escenario de clasificación, se asume que puedensurgir nuevas clases de forma dinámica. Por lo tanto, el modelo debe ser capaz de descubrir nuevas clases de manera automática y sin supervisión. En este contexto, esta tesis propone 2 contribuciones. En primerlugar una solución basada en mixturas de Guassianas donde cada clase en modelada con una de lascomponentes de la mixtura. En segundo lugar, se propone el uso de redes neuronales, tales como las redes Autoencoder, y las redes Deep Support Vector Data Description para trabajar con serie stemporales

    DIVERSIFICATION TRENDS INFERRED FROM THE FOSSIL RECORD

    Get PDF
    Macroevolution focuses on patterns and processes occurring above the level of species and over geological timescales (Raia, 2016). Investigating diversification processes, both morphological and taxonomical, gives a chance to answer important questions in evolutionary biology. Why do some clades have more species than others? Why do some groups undergo striking adaptive radiations, but others persist for millions of years as living fossils? Why do some groups have much more ecological or morphological diversity than others? Does anything limit the number of species on earth and what? These complex questions share a common underlying feature: all, to some degree, concern rates of macroevolutionary change that occur across geological timescales (Rabosky & Slater, 2014). The aim of my project was to produce a coherent array of new methods to investigate phenotypic and taxonomic diversification by using phylogenies including extinct species. I started by developing RRphylo, a new phylogenetic comparative method based on phylogenetic ridge regression, which works with a phylogenetic tree and phenotypic data (univariate or multivariate either) to estimate branch-wise rates of phenotypic evolution and ancestral characters simultaneously. The main innovations, which translate in advantages of RRphylo over existing methods, lies in the absence of any a priori hypothesis about the mode of phenotypic evolution and in its ability to deal with fossil phylogenies. Both these factors make RRphylo very suitable to study phenotypic evolution in its different facets. I further implemented RRphylo to locate clade- or state-related shifts in absolute rates of phenotypic evolution, to integrate the effect of additional (to the phylogeny) predictors on rates estimation, to identify temporal trends in phenotypic mean and evolutionary rates occurring on the entire tree or pertaining individual clades, to identify instances of morphological convergence, to include ancestral character information derived from the fossil record, and to work with discrete variables. All these tools are collected into the RRphylo R package, online from April 2018, and counting > 14000 downloads on CRAN to date. I have been handling the maintenance and updates of the RRphylo package for both the release (https://cran.r-project.org/web/packages/RRphylo/index.html) and development (https://github.com/pasraia/RRphylo) versions, and creating/updating explanatory vignettes to facilitate its usage

    Automated Machine Learning for Multi-Label Classification

    Get PDF

    Analysis of Students' Programming Knowledge and Error Development

    Get PDF
    Programmieren zu lernen ist für viele eine große Herausforderung, da es unterschiedliche Fähigkeiten erfordert. Man muss nicht nur die Programmiersprache und deren Konzepte kennen, sondern es erfordert auch spezifisches Domänenwissen und eine gewisse Problemlösekompetenz. Wissen darüber, wie sich die Programmierkenntnisse Studierender entwickeln und welche Schwierigkeiten sie haben, kann dabei helfen, geeignete Lehrstrategien zu entwickeln. Durch die immer weiter steigenden Studierendenzahlen wird es jedoch zunehmend schwieriger für Lehrkräfte, die Bedürfnisse, Probleme und Schwierigkeiten der Studierenden zu erkennen. Das Ziel dieser Arbeit ist es, Einblick in die Entwicklung von Programmierkenntnissen der Studierenden anhand ihrer Lösungen zu Programmieraufgaben zu gewinnen. Wissen setzt sich aus sogenannten Wissenskomponenten zusammen. In dieser Arbeit fokussieren wir uns auf syntaktische Wissenskomponen, die aus abstrakten Syntaxbäumen abgeleitet werden können, und semantische Wissenskomponenten, die durch sogenannte Variablenrollen repräsentiert werden. Da Wissen an sich nicht direkt messbar ist, werden häufig Skill-Modelle verwendet, um den Kenntnissstand abzuschätzen. Jedoch hat die Programmierdomäne ihre eigenen speziellen Eigenschaften, die bei der Wahl eines geeigneten Skill-Modells berücksichtigt werden müssen. Eine der Haupteigenschaften in der Programmierung ist, dass die Wissenskomponenten nicht unabhängig voneinander sind. Aus diesem Grund schlagen wir ein dynamisches Bayesnetz (DBN) als Skill-Modell vor, da es erlaubt, diese Abhängigkeiten explizit zu modellieren. Neben derWahl eines passenden Skill-Modells, müssen auch bestimmte Meta-Parameter wie beispielsweise die Granularität der Wissenkomponenten festgelegt werden. Daher evaluieren wir, wie sich die Wahl von Meta-Parameters auf die Vorhersagequalität von Skill-Modellen auswirkt und wie diese Meta-Parameter gewählt werden sollten. Wir nutzen das DBN, um Lernkurven für jede Wissenskomponenten zu ermitteln und daraus Implikationen für die Lehre abzuleiten. Nicht nur das Wissen von Studierenden, sondern auch deren “Falsch”-Wissen ist von Bedeutung. Deswegen untersuchen wir zunächst manuell sämtliche Programmierfehler der Studierenden und bestimmen deren Häufigkeit, Dauer und Wiederkehrrate. Wir unterscheiden dabei zwischen den Fehlerkategorien syntaktisch, konzeptuell, strategisch, Nachlässigkeit, Fehlinterpretation und Domäne und schauen, wie sich die Fehler über die Zeit entwickeln. Außerdem verwenden wir k-means-Clustering um potentielle Muster in der Fehlerentwicklung zu finden. Die Ergebnisse unserer Fallstudien sind vielversprechend. Wir können zeigen, dass die Wahl der Meta-Parameter einen großen Einfluss auf die Vorhersagequalität von Modellen hat. Außerdem ist unser DBN vergleichbar leistungsstark wie andere Skill-Modelle, ist gleichzeitig aber besser zu interpretieren. Die Lernkurven der Wissenskomponenten und die Analyse der Programmierfehler liefern uns wertvolle Erkenntnisse, die der Kursverbesserung helfen können, z.B. dass die Studierenden mehr Übungsaufgaben benötigen oder mit welchen Konzepten sie Schwierigkeiten haben.Learning to program is a hard task since it involves different types of specialized knowledge. You do not only need knowledge about the programming language and its concepts, but also knowledge from the problem domain and general problem solving abilities. Knowing how students develop programming knowledge and where they struggle, may help in the development of suitable teaching strategies. However, the ever increasing number of students makes it more and more difficult for educators to identify students’ needs, problems, and deficiencies. The goal of the thesis is to gain insights into students programming knowledge development based on their solutions to programming exercises. Knowledge is composed of so called knowledge components (KCs). In this thesis, we focus on KCs on a syntactic level, which can be derived from abstract systax trees, e.g., loops, comparison, etc., and semantic level, represented by so called roles of variables. Since knowledge is not directly measurable, skill models are an often used for the estimation of knowledge. But, the programming domain has its own characteristics which have to be considered when selecting an appropriate skill model. One of the main characteristics of the programming domain are the dependencies between KCs. Hence, we propose and evaluate a Dynamic Bayesian Network (DBN) for skill modeling which allows to model that dependencies explicitly. Besides the choice of a concrete model, also certain metaparameters like, e.g., the granularity level of KCs, has to be set when designing a skill model. Therefore, we evaluate how meta-parameterization affects the prediction performance of skill models and which meta-parameters to choose. We use the DBN to create learning curves for each KC and deduce implications for teaching from them. But not only students knowledge but also their “mal-knowledge” is of importance. Therefore, we manually inspect students’ programming errors and determine the error’s frequency, duration, and re-occurrence. We distinguish between the error categories syntactic, conceptual, strategic, sloppiness, misinterpretation, and domain and analyze how the errors change over time. Moreover, we use k-means clustering to identify different patterns in the development of programming errors. The results of our case studies are promising. We show that the correct metaparameterization has a huge effect on the prediction performance of skill models. In addition, our DBN performs as well as the other skill models while providing better interpretability. The learning curves of KCs and the analysis of programming errors provide valuable information which can be used for course improvement, e.g., that students require more practice opportunities or are struggling with certain concepts.2022-02-0
    corecore