4,111 research outputs found

    Punctuality Predictions in Public Transportation: Quantifying the Effect of External Factors

    Get PDF
    Increasing availability of large-scale datasets for automatic vehicle location (AVL) in public transportation (PT) encouraged researchers to investigate data-driven punctuality prediction models (PPMs). PPMs promise to accelerate the mobility transition through more accurate prediction delays, increased customer service levels, and more efficient and forward-looking planning by mobility providers. While several PPMs show promising results for buses and long-distance trains, a comprehensive study on external factors\u27 effect on tram services is missing. Therefore, we implement four machine learning (ML) models to predict departure delays and elaborate on the performance increase by adding real-world weather and holiday data for three consecutive years. For our best model (XGBoost) the average MAE performance increased by 17.33 % compared to the average model performance when only trained on AVL data enriched by timetable characteristics. The results provide strong evidence that adding information-bearing features improves the forecast quality of PPMs

    Generating public transport data based on population distributions for RDF benchmarking

    Get PDF
    When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PODiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PODiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PODiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data

    Delay prediction system for large-scale railway networks based on big data analytics

    Get PDF
    State-of-the-art train delay prediction systems do not exploit historical train movements data collected by the railway information systems, but they rely on static rules built by expert of the railway infrastructure based on classical univariate statistic. The purpose of this paper is to build a data-driven train delay prediction system for largescale railway networks which exploits the most recent Big Data technologies and learning algorithms. In particular, we propose a fast learning algorithm for predicting train delays based on the Extreme Learning Machine that fully exploits the recent in-memory large-scale data processing technologies. Our system is able to rapidly extract nontrivial information from the large amount of data available in order to make accurate predictions about different future states of the railway network. Results on real world data coming from the Italian railway network show that our proposal is able to improve the current state-of-the-art train delay prediction systems

    Statistical and machine learning models for critical infrastructure resilience

    Get PDF
    This thesis presents a data-driven approach to improving predictions of critical infrastructure behaviors. In our first approach, we explore novel data sources and time series modeling techniques to model disaster impacts on power systems through the case study of Hurricane Sandy as it impacted the state of New York. We find a correlation between Twitter data and load forecast errors, suggesting that Twitter data may provide value towards predicting impacts of disasters on infrastructure systems. Based on these findings, we then develop time series forecasting methods to predict the NYISO power system behaviors at the zonal level, utilizing Twitter and load forecast data as model inputs. In our second approach, we develop a novel, graph-based formulation of the British rail network to model the nonlinear cascading delays on the rail network. Using this formulation, we then develop machine learning approaches to predict delays in the rail network. Through experiments on real-world rail data, we find that the selected architecture provides more accurate predictions than other models due to its ability to capture both spatial and temporal dimensions of the data

    Semi-decentralized Inference in Heterogeneous Graph Neural Networks for Traffic Demand Forecasting: An Edge-Computing Approach

    Full text link
    Prediction of taxi service demand and supply is essential for improving customer's experience and provider's profit. Recently, graph neural networks (GNNs) have been shown promising for this application. This approach models city regions as nodes in a transportation graph and their relations as edges. GNNs utilize local node features and the graph structure in the prediction. However, more efficient forecasting can still be achieved by following two main routes; enlarging the scale of the transportation graph, and simultaneously exploiting different types of nodes and edges in the graphs. However, both approaches are challenged by the scalability of GNNs. An immediate remedy to the scalability challenge is to decentralize the GNN operation. However, this creates excessive node-to-node communication. In this paper, we first characterize the excessive communication needs for the decentralized GNN approach. Then, we propose a semi-decentralized approach utilizing multiple cloudlets, moderately sized storage and computation devices, that can be integrated with the cellular base stations. This approach minimizes inter-cloudlet communication thereby alleviating the communication overhead of the decentralized approach while promoting scalability due to cloudlet-level decentralization. Also, we propose a heterogeneous GNN-LSTM algorithm for improved taxi-level demand and supply forecasting for handling dynamic taxi graphs where nodes are taxis. Extensive experiments over real data show the advantage of the semi-decentralized approach as tested over our heterogeneous GNN-LSTM algorithm. Also, the proposed semi-decentralized GNN approach is shown to reduce the overall inference time by about an order of magnitude compared to centralized and decentralized inference schemes.Comment: 13 pages, 10 figures, LaTeX; typos corrected, references added, mathematical analysis adde

    Statistical Inference for Propagation Processes on Complex Networks

    Get PDF
    Die Methoden der Netzwerktheorie erfreuen sich wachsender Beliebtheit, da sie die Darstellung von komplexen Systemen durch Netzwerke erlauben. Diese werden nur mit einer Menge von Knoten erfasst, die durch Kanten verbunden werden. Derzeit verfügbare Methoden beschränken sich hauptsächlich auf die deskriptive Analyse der Netzwerkstruktur. In der hier vorliegenden Arbeit werden verschiedene Ansätze für die Inferenz über Prozessen in komplexen Netzwerken vorgestellt. Diese Prozesse beeinflussen messbare Größen in Netzwerkknoten und werden durch eine Menge von Zufallszahlen beschrieben. Alle vorgestellten Methoden sind durch praktische Anwendungen motiviert, wie die Übertragung von Lebensmittelinfektionen, die Verbreitung von Zugverspätungen, oder auch die Regulierung von genetischen Effekten. Zunächst wird ein allgemeines dynamisches Metapopulationsmodell für die Verbreitung von Lebensmittelinfektionen vorgestellt, welches die lokalen Infektionsdynamiken mit den netzwerkbasierten Transportwegen von kontaminierten Lebensmitteln zusammenführt. Dieses Modell ermöglicht die effiziente Simulationen verschiedener realistischer Lebensmittelinfektionsepidemien. Zweitens wird ein explorativer Ansatz zur Ursprungsbestimmung von Verbreitungsprozessen entwickelt. Auf Grundlage einer netzwerkbasierten Redefinition der geodätischen Distanz können komplexe Verbreitungsmuster in ein systematisches, kreisrundes Ausbreitungsschema projiziert werden. Dies gilt genau dann, wenn der Ursprungsnetzwerkknoten als Bezugspunkt gewählt wird. Die Methode wird erfolgreich auf den EHEC/HUS Epidemie 2011 in Deutschland angewandt. Die Ergebnisse legen nahe, dass die Methode die aufwändigen Standarduntersuchungen bei Lebensmittelinfektionsepidemien sinnvoll ergänzen kann. Zudem kann dieser explorative Ansatz zur Identifikation von Ursprungsverspätungen in Transportnetzwerken angewandt werden. Die Ergebnisse von umfangreichen Simulationsstudien mit verschiedenstensten Übertragungsmechanismen lassen auf eine allgemeine Anwendbarkeit des Ansatzes bei der Ursprungsbestimmung von Verbreitungsprozessen in vielfältigen Bereichen hoffen. Schließlich wird gezeigt, dass kernelbasierte Methoden eine Alternative für die statistische Analyse von Prozessen in Netzwerken darstellen können. Es wurde ein netzwerkbasierter Kern für den logistischen Kernel Machine Test entwickelt, welcher die nahtlose Integration von biologischem Wissen in die Analyse von Daten aus genomweiten Assoziationsstudien erlaubt. Die Methode wird erfolgreich bei der Analyse genetischer Ursachen für rheumatische Arthritis und Lungenkrebs getestet. Zusammenfassend machen die Ergebnisse der vorgestellten Methoden deutlich, dass die Netzwerk-theoretische Analyse von Verbreitungsprozessen einen wesentlichen Beitrag zur Beantwortung verschiedenster Fragestellungen in unterschiedlichen Anwendungen liefern kann

    Complex Correlation Approach for High Frequency Financial Data

    Full text link
    We propose a novel approach that allows to calculate Hilbert transform based complex correlation for unevenly spaced data. This method is especially suitable for high frequency trading data, which are of a particular interest in finance. Its most important feature is the ability to take into account lead-lag relations on different scales, without knowing them in advance. We also present results obtained with this approach while working on Tokyo Stock Exchange intraday quotations. We show that individual sectors and subsectors tend to form important market components which may follow each other with small but significant delays. These components may be recognized by analysing eigenvectors of complex correlation matrix for Nikkei 225 stocks. Interestingly, sectorial components are also found in eigenvectors corresponding to the bulk eigenvalues, traditionally treated as noise

    Mining topological dependencies of recurrent congestion in road networks

    Get PDF
    The discovery of spatio-temporal dependencies within urban road networks that cause Recurrent Congestion (RC) patterns is crucial for numerous real-world applications, including urban planning and the scheduling of public transportation services. While most existing studies investigate temporal patterns of RC phenomena, the influence of the road network topology on RC is often over-looked. This article proposes the ST-DISCOVERY algorithm, a novel unsupervised spatio-temporal data mining algorithm that facilitates effective data-driven discovery of RC dependencies induced by the road network topology using real-world traffic data. We factor out regularly reoccurring traffic phenomena, such as rush hours, mainly induced by the daytime, by modelling and systematically exploiting temporal traffic load outliers. We present an algorithm that first constructs connected subgraphs of the road network based on the traffic speed outliers. Second, the algorithm identifies pairs of subgraphs that indicate spatio-temporal correlations in their traffic load behaviour to identify topological dependencies within the road network. Finally, we rank the identified subgraph pairs based on the dependency score determined by our algorithm. Our experimental results demonstrate that ST-DISCOVERY can effectively reveal topological dependencies in urban road networks
    corecore