738 research outputs found
Forecasting day-ahead electricity prices in Europe: the importance of considering market integration
Motivated by the increasing integration among electricity markets, in this
paper we propose two different methods to incorporate market integration in
electricity price forecasting and to improve the predictive performance. First,
we propose a deep neural network that considers features from connected markets
to improve the predictive accuracy in a local market. To measure the importance
of these features, we propose a novel feature selection algorithm that, by
using Bayesian optimization and functional analysis of variance, evaluates the
effect of the features on the algorithm performance. In addition, using market
integration, we propose a second model that, by simultaneously predicting
prices from two markets, improves the forecasting accuracy even further. As a
case study, we consider the electricity market in Belgium and the improvements
in forecasting accuracy when using various French electricity features. We show
that the two proposed models lead to improvements that are statistically
significant. Particularly, due to market integration, the predictive accuracy
is improved from 15.7% to 12.5% sMAPE (symmetric mean absolute percentage
error). In addition, we show that the proposed feature selection algorithm is
able to perform a correct assessment, i.e. to discard the irrelevant features
Uncertainty Intervals for Prediction Errors in Time Series Forecasting
Inference for prediction errors is critical in time series forecasting
pipelines. However, providing statistically meaningful uncertainty intervals
for prediction errors remains relatively under-explored. Practitioners often
resort to forward cross-validation (FCV) for obtaining point estimators and
constructing confidence intervals based on the Central Limit Theorem (CLT). The
naive version assumes independence, a condition that is usually invalid due to
time correlation. These approaches lack statistical interpretations and
theoretical justifications even under stationarity.
This paper systematically investigates uncertainty intervals for prediction
errors in time series forecasting. We first distinguish two key inferential
targets: the stochastic test error over near future data points, and the
expected test error as the expectation of the former. The stochastic test error
is often more relevant in applications needing to quantify uncertainty over
individual time series instances. To construct prediction intervals for the
stochastic test error, we propose the quantile-based forward cross-validation
(QFCV) method. Under an ergodicity assumption, QFCV intervals have
asymptotically valid coverage and are shorter than marginal empirical
quantiles. In addition, we also illustrate why naive CLT-based FCV intervals
fail to provide valid uncertainty intervals, even with certain corrections. For
non-stationary time series, we further provide rolling intervals by combining
QFCV with adaptive conformal prediction to give time-average coverage
guarantees. Overall, we advocate the use of QFCV procedures and demonstrate
their coverage and efficiency through simulations and real data examples.Comment: 35 pages, 17 figure
Domain Adaptation for Time Series Forecasting via Attention Sharing
Recent years have witnessed deep neural networks gaining increasing
popularity in the field of time series forecasting. A primary reason of their
success is their ability to effectively capture complex temporal dynamics
across multiple related time series. However, the advantages of these deep
forecasters only start to emerge in the presence of a sufficient amount of
data. This poses a challenge for typical forecasting problems in practice,
where one either has a small number of time series, or limited observations per
time series, or both. To cope with the issue of data scarcity, we propose a
novel domain adaptation framework, Domain Adaptation Forecaster (DAF), that
leverages the statistical strengths from another relevant domain with abundant
data samples (source) to improve the performance on the domain of interest with
limited data (target). In particular, we propose an attention-based shared
module with a domain discriminator across domains as well as private modules
for individual domains. This allows us to jointly train the source and target
domains by generating domain-invariant latent features while retraining
domain-specific features. Extensive experiments on various domains demonstrate
that our proposed method outperforms state-of-the-art baselines on synthetic
and real-world datasets.Comment: 19 pages, 9 figure
Hierarchy-guided Model Selection for Time Series Forecasting
Generalizability of time series forecasting models depends on the quality of
model selection. Temporal cross validation (TCV) is a standard technique to
perform model selection in forecasting tasks. TCV sequentially partitions the
training time series into train and validation windows, and performs
hyperparameter optmization (HPO) of the forecast model to select the model with
the best validation performance. Model selection with TCV often leads to poor
test performance when the test data distribution differs from that of the
validation data. We propose a novel model selection method, H-Pro that exploits
the data hierarchy often associated with a time series dataset. Generally, the
aggregated data at the higher levels of the hierarchy show better
predictability and more consistency compared to the bottom-level data which is
more sparse and (sometimes) intermittent. H-Pro performs the HPO of the
lowest-level student model based on the test proxy forecasts obtained from a
set of teacher models at higher levels in the hierarchy. The consistency of the
teachers' proxy forecasts help select better student models at the
lowest-level. We perform extensive empirical studies on multiple datasets to
validate the efficacy of the proposed method. H-Pro along with off-the-shelf
forecasting models outperform existing state-of-the-art forecasting methods
including the winning models of the M5 point-forecasting competition
Classifying distinct data types: textual streams protein sequences and genomic variants
Artificial Intelligence (AI) is an interdisciplinary field combining different research areas with the end goal to automate processes in the everyday life and industry. The fundamental components of AI models are an “intelligent” model and a functional component defined by the end-application. That is, an intelligent model can be a statistical model that can recognize patterns in data instances to distinguish differences in between these instances.
For example, if the AI is applied in car manufacturing, based on an image of a part of a car, the model can categorize if the car part is in the front, middle or rear compartment of the car, as a human brain would do. For the same example application, the statistical model informs a mechanical arm, the functional component, for the current car compartment and the arm in turn assembles this compartment, of the car, based on predefined instructions, likely as a human hand would follow human brain neural signals. A crucial step of AI applications is the classification of input instances by the intelligent model.
The classification step in the intelligent model pipeline allows the subsequent steps to act in similar fashion for instances belonging to the same category. We define as classification the module of the intelligent model, which categorizes the input instances based on predefined human-expert or data-driven produced patterns of the instances. Irrespectively
of the method to find patterns in data, classification is composed of four distinct steps: (i) input representation, (ii) model building (iii) model prediction and (iv) model assessment. Based on these classification steps, we argue that applying classification on distinct data
types holds different challenges.
In this thesis, I focus on challenges for three distinct classification scenarios: (i) Textual Streams: how to advance the model building step, commonly used for static distribution of data, to classify textual posts with transient data distribution? (ii) Protein Prediction: which biologically meaningful information can be used in the input representation step to overcome the limited training data challenge? (iii) Human Variant Pathogenicity Prediction:
how to develop a classification system for functional impact of human variants, by providing standardized and well accepted evidence for the classification outcome and thus enabling the model assessment step?
To answer these research questions, I present my contributions in classifying these different types of data: temporalMNB: I adapt the sequential prediction with expert advice paradigm to optimally aggregate complementary distributions to enhance a Naive Bayes model to adapt on drifting distribution of the characteristics of the textual posts. dom2vec:
our proposal to learn embedding vectors for the protein domains using self-supervision. Based on the high performance achieved by the dom2vec embeddings in quantitative intrinsic assessment on the captured biological information, I provide example evidence for an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Last, I describe GenOtoScope bioinformatics software tool to automate standardized evidence-based criteria for pathogenicity impact of variants associated with hearing loss. Finally, to increase the practical use of our last contribution, I develop easy-to-use software interfaces to be used, in research settings, by clinical diagnostics personnel.Künstliche Intelligenz (KI) ist ein interdisziplinäres Gebiet, das verschiedene Forschungsbereiche mit dem Ziel verbindet, Prozesse im Alltag und in der Industrie zu automatisieren. Die grundlegenden Komponenten von KI-Modellen sind ein “intelligentes” Modell und eine durch die Endanwendung definierte funktionale Komponente. Das heißt, ein intelligentes Modell kann ein statistisches Modell sein, das Muster in Dateninstanzen erkennen kann, um Unterschiede zwischen diesen Instanzen zu unterscheiden. Wird die KI beispielsweise in der Automobilherstellung eingesetzt, kann das Modell auf der Grundlage eines Bildes eines Autoteils
kategorisieren, ob sich das Autoteil im vorderen, mittleren oder hinteren Bereich des Autos befindet, wie es ein menschliches Gehirn tun würde. Bei der gleichen Beispielanwendung informiert das statistische Modell einen mechanischen Arm, die funktionale Komponente, über den aktuellen Fahrzeugbereich, und der Arm wiederum baut diesen Bereich des Fahrzeugs auf der Grundlage vordefinierter Anweisungen zusammen, so wie eine menschliche Hand den neuronalen Signalen des menschlichen Gehirns folgen würde. Ein entscheidender Schritt bei KI-Anwendungen ist die Klassifizierung von Eingabeinstanzen durch das intelligente
Modell. Unabhängig von der Methode zum Auffinden von Mustern in Daten besteht die Klassifizierung aus vier verschiedenen Schritten: (i) Eingabedarstellung, (ii) Modellbildung, (iii) Modellvorhersage und (iv) Modellbewertung. Ausgehend von diesen Klassifizierungsschritten argumentiere ich, dass die Anwendung der Klassifizierung auf verschiedene Datentypen unterschiedliche Herausforderungen mit sich bringt. In dieser Arbeit konzentriere ich uns auf die Herausforderungen für drei verschiedene Klassifizierungsszenarien: (i) Textdatenströme: Wie kann der Schritt der Modellerstellung, der üblicherweise für eine statische Datenverteilung verwendet wird, weiterentwickelt werden, um die Klassifizierung von Textbeiträgen mit einer instationären Datenverteilung zu erlernen? (ii) Proteinvorhersage: Welche biologisch sinnvollen Informationen können im Schritt der Eingabedarstellung verwendet werden, um die Herausforderung der begrenzten Trainingsdaten zu überwinden? (iii) Vorhersage der Pathogenität menschlicher Varianten:
Wie kann ein Klassifizierungssystem für die funktionellen Auswirkungen menschlicher Varianten entwickelt werden, indem standardisierte und anerkannte Beweise für das Klassifizierungsergebnis bereitgestellt werden und somit der Schritt der Modellbewertung ermöglicht wird? Um diese Forschungsfragen zu beantworten, stelle ich meine Beiträge zur Klassifizierung dieser verschiedenen Datentypen vor: temporalMNB: Verbesserung des Naive-Bayes-Modells zur Klassifizierung driftender Textströme durch Ensemble-Lernen. dom2vec: Lernen von
Einbettungsvektoren für Proteindomänen durch Selbstüberwachung. Auf der Grundlage der berichteten Ergebnisse liefere ich Beispiele für eine Analogie zwischen den lokalen linguistischen Merkmalen in natürlichen Sprachen und den Domänenstruktur- und Funktionsinformationen in Domänenarchitekturen. Schließlich beschreibe ich ein bioinformatisches Softwaretool, GenOtoScope, zur Automatisierung standardisierter evidenzbasierter Kriterien für die orthogenitätsauswirkungen von Varianten, die mit angeborener Schwerhörigkeit in
Verbindung stehen
- …