4,111 research outputs found
Punctuality Predictions in Public Transportation: Quantifying the Effect of External Factors
Increasing availability of large-scale datasets for automatic vehicle location (AVL) in public transportation (PT) encouraged researchers to investigate data-driven punctuality prediction models (PPMs). PPMs promise to accelerate the mobility transition through more accurate prediction delays, increased customer service levels, and more efficient and forward-looking planning by mobility providers. While several PPMs show promising results for buses and long-distance trains, a comprehensive study on external factors\u27 effect on tram services is missing. Therefore, we implement four machine learning (ML) models to predict departure delays and elaborate on the performance increase by adding real-world weather and holiday data for three consecutive years. For our best model (XGBoost) the average MAE performance increased by 17.33 % compared to the average model performance when only trained on AVL data enriched by timetable characteristics. The results provide strong evidence that adding information-bearing features improves the forecast quality of PPMs
Generating public transport data based on population distributions for RDF benchmarking
When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PODiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PODiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PODiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data
Delay prediction system for large-scale railway networks based on big data analytics
State-of-the-art train delay prediction systems do not exploit historical train movements data collected by the railway information systems, but they rely on static rules built by expert of the railway infrastructure based on classical univariate statistic. The purpose of this paper is to build a data-driven train delay prediction system for largescale railway networks which exploits the most recent Big Data technologies and learning algorithms. In particular, we propose a fast learning algorithm for predicting train delays based on the Extreme Learning Machine that fully exploits the recent in-memory large-scale data processing technologies. Our system is able to rapidly extract nontrivial information from the large amount of data available in order to make accurate predictions about different future states of the railway network. Results on real world data coming from the Italian railway network show that our proposal is able to improve the current state-of-the-art train delay prediction systems
Statistical and machine learning models for critical infrastructure resilience
This thesis presents a data-driven approach to improving predictions of critical infrastructure behaviors. In our first approach, we explore novel data sources and time series modeling techniques to model disaster impacts on power systems through the case study of Hurricane Sandy as it impacted the state of New York. We find a correlation between Twitter data and load forecast errors, suggesting that Twitter data may provide value towards predicting impacts of disasters on infrastructure systems. Based on these findings, we then develop time series forecasting methods to predict the NYISO power system behaviors at the zonal level, utilizing Twitter and load forecast data as model inputs.
In our second approach, we develop a novel, graph-based formulation of the British rail network to model the nonlinear cascading delays on the rail network. Using this formulation, we then develop machine learning approaches to predict delays in the rail network. Through experiments on real-world rail data, we find that the selected architecture provides more accurate predictions than other models due to its ability to capture both spatial and temporal dimensions of the data
Semi-decentralized Inference in Heterogeneous Graph Neural Networks for Traffic Demand Forecasting: An Edge-Computing Approach
Prediction of taxi service demand and supply is essential for improving
customer's experience and provider's profit. Recently, graph neural networks
(GNNs) have been shown promising for this application. This approach models
city regions as nodes in a transportation graph and their relations as edges.
GNNs utilize local node features and the graph structure in the prediction.
However, more efficient forecasting can still be achieved by following two main
routes; enlarging the scale of the transportation graph, and simultaneously
exploiting different types of nodes and edges in the graphs. However, both
approaches are challenged by the scalability of GNNs. An immediate remedy to
the scalability challenge is to decentralize the GNN operation. However, this
creates excessive node-to-node communication. In this paper, we first
characterize the excessive communication needs for the decentralized GNN
approach. Then, we propose a semi-decentralized approach utilizing multiple
cloudlets, moderately sized storage and computation devices, that can be
integrated with the cellular base stations. This approach minimizes
inter-cloudlet communication thereby alleviating the communication overhead of
the decentralized approach while promoting scalability due to cloudlet-level
decentralization. Also, we propose a heterogeneous GNN-LSTM algorithm for
improved taxi-level demand and supply forecasting for handling dynamic taxi
graphs where nodes are taxis. Extensive experiments over real data show the
advantage of the semi-decentralized approach as tested over our heterogeneous
GNN-LSTM algorithm. Also, the proposed semi-decentralized GNN approach is shown
to reduce the overall inference time by about an order of magnitude compared to
centralized and decentralized inference schemes.Comment: 13 pages, 10 figures, LaTeX; typos corrected, references added,
mathematical analysis adde
Statistical Inference for Propagation Processes on Complex Networks
Die Methoden der Netzwerktheorie erfreuen sich wachsender Beliebtheit, da sie die Darstellung von komplexen Systemen durch Netzwerke erlauben. Diese werden nur mit einer Menge von Knoten erfasst, die durch Kanten verbunden werden. Derzeit verfügbare Methoden beschränken sich hauptsächlich auf die deskriptive Analyse der Netzwerkstruktur. In der hier vorliegenden Arbeit werden verschiedene Ansätze für die Inferenz über Prozessen in komplexen Netzwerken vorgestellt. Diese Prozesse beeinflussen messbare Größen in Netzwerkknoten und werden durch eine Menge von Zufallszahlen beschrieben. Alle vorgestellten Methoden sind durch praktische Anwendungen motiviert, wie die Übertragung von Lebensmittelinfektionen, die Verbreitung von Zugverspätungen, oder auch die Regulierung von genetischen Effekten. Zunächst wird ein allgemeines dynamisches Metapopulationsmodell für die Verbreitung von Lebensmittelinfektionen vorgestellt, welches die lokalen Infektionsdynamiken mit den netzwerkbasierten Transportwegen von kontaminierten Lebensmitteln zusammenführt. Dieses Modell ermöglicht die effiziente Simulationen verschiedener realistischer Lebensmittelinfektionsepidemien. Zweitens wird ein explorativer Ansatz zur Ursprungsbestimmung von Verbreitungsprozessen entwickelt. Auf Grundlage einer netzwerkbasierten Redefinition der geodätischen Distanz können komplexe Verbreitungsmuster in ein systematisches, kreisrundes Ausbreitungsschema projiziert werden. Dies gilt genau dann, wenn der Ursprungsnetzwerkknoten als Bezugspunkt gewählt wird. Die Methode wird erfolgreich auf den EHEC/HUS Epidemie 2011 in Deutschland angewandt. Die Ergebnisse legen nahe, dass die Methode die aufwändigen Standarduntersuchungen bei Lebensmittelinfektionsepidemien sinnvoll ergänzen kann. Zudem kann dieser explorative Ansatz zur Identifikation von Ursprungsverspätungen in Transportnetzwerken angewandt werden. Die Ergebnisse von umfangreichen Simulationsstudien mit verschiedenstensten Übertragungsmechanismen lassen auf eine allgemeine Anwendbarkeit des Ansatzes bei der Ursprungsbestimmung von Verbreitungsprozessen in vielfältigen Bereichen hoffen. Schließlich wird gezeigt, dass kernelbasierte Methoden eine Alternative für die statistische Analyse von Prozessen in Netzwerken darstellen können. Es wurde ein netzwerkbasierter Kern für den logistischen Kernel Machine Test entwickelt, welcher die nahtlose Integration von biologischem Wissen in die Analyse von Daten aus genomweiten Assoziationsstudien erlaubt. Die Methode wird erfolgreich bei der Analyse genetischer Ursachen für rheumatische Arthritis und Lungenkrebs getestet. Zusammenfassend machen die Ergebnisse der vorgestellten Methoden deutlich, dass die Netzwerk-theoretische Analyse von Verbreitungsprozessen einen wesentlichen Beitrag zur Beantwortung verschiedenster Fragestellungen in unterschiedlichen Anwendungen liefern kann
Complex Correlation Approach for High Frequency Financial Data
We propose a novel approach that allows to calculate Hilbert transform based
complex correlation for unevenly spaced data. This method is especially
suitable for high frequency trading data, which are of a particular interest in
finance. Its most important feature is the ability to take into account
lead-lag relations on different scales, without knowing them in advance. We
also present results obtained with this approach while working on Tokyo Stock
Exchange intraday quotations. We show that individual sectors and subsectors
tend to form important market components which may follow each other with small
but significant delays. These components may be recognized by analysing
eigenvectors of complex correlation matrix for Nikkei 225 stocks.
Interestingly, sectorial components are also found in eigenvectors
corresponding to the bulk eigenvalues, traditionally treated as noise
Mining topological dependencies of recurrent congestion in road networks
The discovery of spatio-temporal dependencies within urban road networks that cause Recurrent Congestion (RC) patterns is crucial for numerous real-world applications, including urban planning and the scheduling of public transportation services. While most existing studies investigate temporal patterns of RC phenomena, the influence of the road network topology on RC is often over-looked. This article proposes the ST-DISCOVERY algorithm, a novel unsupervised spatio-temporal data mining algorithm that facilitates effective data-driven discovery of RC dependencies induced by the road network topology using real-world traffic data. We factor out regularly reoccurring traffic phenomena, such as rush hours, mainly induced by the daytime, by modelling and systematically exploiting temporal traffic load outliers. We present an algorithm that first constructs connected subgraphs of the road network based on the traffic speed outliers. Second, the algorithm identifies pairs of subgraphs that indicate spatio-temporal correlations in their traffic load behaviour to identify topological dependencies within the road network. Finally, we rank the identified subgraph pairs based on the dependency score determined by our algorithm. Our experimental results demonstrate that ST-DISCOVERY can effectively reveal topological dependencies in urban road networks
- …