817 research outputs found

    Using graph structural information about flows to enhance short-term demand prediction in bike-sharing systems

    Get PDF
    Short-term demand prediction is important for managing transportation infrastructure, particularly in times of disruption, or around new developments. Many bike-sharing schemes face the challenges of managing service provision and bike fleet rebalancing due to the “tidal flows” of travel and use. For them, it is crucial to have precise predictions of travel demand at a fine spatiotemporal granularities. Despite recent advances in machine learning approaches (e.g. deep neural networks) and in short-term traffic demand predictions, relatively few studies have examined this issue using a feature engineering approach to inform model selection. This research extracts novel time-lagged variables describing graph structures and flow interactions from real-world bike usage datasets, including graph node Out-strength, In-strength, Out-degree, In-degree and PageRank. These are used as inputs to different machine learning algorithms to predict short-term bike demand. The results of the experiments indicate the graph-based attributes to be more important in demand prediction than more commonly used meteorological information. The results from the different machine learning approaches (XGBoost, MLP, LSTM) improve when time-lagged graph information is included. Deep neural networks were found to be better able to handle the sequences of the time-lagged graph variables than other approaches, resulting in more accurate forecasting. Thus incorporating graph-based features can improve understanding and modelling of demand patterns in urban areas, supporting bike-sharing schemes and promoting sustainable transport. The proposed approach can be extended into many existing models using spatial data and can be readily transferred to other applications for predicting dynamics in mass transit systems. A number of limitations and areas of further work are discussed

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    Predicting the Most Tractable Protein Surfaces in the Human Proteome for Developing New Therapeutics

    Get PDF
    A critical step in the target identification phase of drug discovery is evaluating druggability, i.e., whether a protein can be targeted with high affinity using drug-like ligands. The overarching goal of my PhD thesis is to build a machine learning model that predicts the binding affinity that can be attained when addressing a given protein surface. I begin by examining the lead optimization phase of drug development, where I find that in a test set of 297 examples, 41 of these (14%) change binding mode when a ligand is elaborated. My analysis shows that while certain ligand physiochemical properties predispose changes in binding mode, particularly those properties that define fragments, simple structure-based modeling proves far more effective for identifying substitutions that alter the binding mode. My proposed measure of RMAC (rmsd after minimization of the aligned complex) can help determine whether a given ligand can be reliably elaborated without changing binding mode, thus enabling straightforward interpretation of the resulting structure-activity relationships. Moving forward, I next noted that a very popular machine learning algorithm for regression tasks, random forest, has a systematic bias in the predictions it generates; this bias is present in both real-world datasets and synthetic datasets. To address this, I define a numerical transformation that can be applied to the output of random forest models. This transformation fully removes the bias in the resulting predictions, and yields improved predictions across all datasets. Finally, taking advantage of this improved machine learning approach, I describe a model that predicts the “attainable binding affinity” for a given binding pocket on a protein surface. This model uses 13 physiochemical and structural features calculated from the protein structure, without any information about the ligand. While details of the ligand must (of course) contribute somewhat to the binding affinity, I find that this model still recapitulates the binding affinity for 848 different protein-ligand complexes (across 230 different proteins) with correlation coefficient 0.57. I further find that this model is not limited to “traditional” drug targets, but rather that it works just as well for emerging “non-traditional” drug targets such as inhibitors of protein-protein interactions. Collectively, I anticipate that the tools and insights generated in the course of my PhD research will play an important role in facilitating the key target selection phase of drug discovery projects

    Systems Engineering: Availability and Reliability

    Get PDF
    Current trends in Industry 4.0 are largely related to issues of reliability and availability. As a result of these trends and the complexity of engineering systems, research and development in this area needs to focus on new solutions in the integration of intelligent machines or systems, with an emphasis on changes in production processes aimed at increasing production efficiency or equipment reliability. The emergence of innovative technologies and new business models based on innovation, cooperation networks, and the enhancement of endogenous resources is assumed to be a strong contribution to the development of competitive economies all around the world. Innovation and engineering, focused on sustainability, reliability, and availability of resources, have a key role in this context. The scope of this Special Issue is closely associated to that of the ICIE’2020 conference. This conference and journal’s Special Issue is to present current innovations and engineering achievements of top world scientists and industrial practitioners in the thematic areas related to reliability and risk assessment, innovations in maintenance strategies, production process scheduling, management and maintenance or systems analysis, simulation, design and modelling

    Statistical Learning for Structured Models: Tree Based Methods and Neural Networks

    Get PDF
    In this thesis, estimation in regression and classification problems which include low dimensional structures are considered. The underlying question is the following. How well do statistical learn- ing methods perform for models with low dimensional structures? We approach this question using various algorithms in various settings. For our first main contribution, we prove optimal convergence rates in a classification setting using neural networks. While non-optimal rates ex- isted for this problem, we are the first to prove optimal ones. Secondly, we introduce a new tree based algorithm we named random planted forest. It adapts particularly well to models which consist of low dimensional structures. We examine its performance in simulation studies and include some theoretical backing by proving optimal convergence rates in certain settings for a modification of the algorithm. Additionally, a generalized version of the algorithm is included, which can be used in classification settings. In a further contribution, we prove optimal con- vergence rates for the local linear smooth backfitting algorithm. While such rates have already been established, we bring a new simpler perspective to the problem which leads to better understanding and easier interpretation. Additionally, given an estimator in a regression setting, we propose a constraint which leads to a unique decomposition. This decomposition is useful for visualising and interpreting the estimator, in particular if it consits of low dimenional structures

    Big Data and Climate Change

    Get PDF
    open access articleClimate science as a data-intensive subject has overwhelmingly affected by the era of big data and relevant technological revolutions. The big successes of big data analytics in diverse areas over the past decade have also prompted the expectation of big data and its efficacy on the big problem—climate change. As an emerging topic, climate change has been at the forefront of the big climate data analytics implementations and exhaustive research have been carried out covering a variety of topics. This paper aims to present an outlook of big data in climate change studies over the recent years by investigating and summarising the current status of big data applications in climate change related studies. It is also expected to serve as a one-stop reference directory for researchers and stakeholders with an overview of this trending subject at a glance, which can be useful in guiding future research and improvements in the exploitation of big climate data

    Comprehensive Safety Analysis of Vulnerable Road User Involved Motor Vehicle Crashes

    Get PDF
    This dissertation explores, identifies, and evaluates a multitude of factors significantly affecting motor vehicle crashes involving pedestrians and bicyclists, commonly defined as vulnerable road users (VRUs). The methodologies are guided by the concept of safe behavior of different parties that are primary responsible for a crash, either a pedestrian, a bicyclist or a driver, pertaining to roadway design, traffic conditions, land use and built environment variables; and the findings are beneficial for recommending targeted and effective safety interventions. The topic is motivated by the fact that human factors contribute to over ninety percent of the crashes, especially the ones involving VRUs. Studying the effect of road users’ behavior, their responses to the dynamics of traveling environment, and compliance rate to traffic rules is instrumental to precisely measure and evaluate how each of the investigated variables changes the crash risk. To achieve this goal, an extensive database is established based on data collected from sources such as the linework from topologically integrated geographic encoding and referencing, Google maps, motor vehicle accident reports, Wisconsin Information System for Local Roads, and Smart Location Dataset from Environmental Protection Agency. The crosscutting datasets represent various aspects of motorist and non-motorists travel decisions and behaviors, as well as their safety status. With this comprehensive database, intrinsic relationships between pedestrian-vehicle crashes and a broad range of socioeconomic and demographic factors, land use and built environment, crime rate and traffic violations, road design, traffic control, and pedestrian-oriented design features are identified, analyzed, and evaluated. The comprehensive safety analysis begins with the structural equation model (SEM) that is employed to discover possible underlying factor structure connecting exogenous variables and crashes involving pedestrians. Informed by the SEM output, the analysis continues with the development of crash count models and responsible party choice models to respectively address factors relating to roles in a crash by pedestrians and drivers. As a result, factors contributing to crashes where a pedestrian is responsible, a driver is responsible, or both parties are responsible can be specified, categorized, and quantified. Moreover, targeted and appropriate safety countermeasures can be designed, recommended, and prioritized by engineers, planners, or enforcement agencies to jointly create a pedestrian-friendly environment. The second aspect of the analysis is to specify the crash party at-fault, which provides evidence about whether pedestrians, bicyclists or drivers are more likely to be involved in severe crashes and to identify the contributing factors that affect the fault of a specific road user group. An extensive investigation of the available information regarding the crash (i.e., issued citations, actions/circumstances that may have played a role in the crash occurrence, and crash scenario completed by the police officer) are considered. The goal is to recognize and measure the factors affecting a specific party at-fault. This provides information that is vital for proactive crisis management: to decrease and to prevent future crashes. As a part of the result, a guideline is proposed to assign the party at-fault through crash data fields and narratives. Statistical methods such as the extreme gradient boosting (XGboost) decision tree and the multinomial logit (MNL) model are used. Appealing conclusions have been found and suggestions are made for law enforcement, education, and roadway management to enhance the safety countermeasures. The third aspect is to evaluate the enhancements of crash report form for its effectiveness of reporting VRU involved motor vehicle crashes. One of the State of Wisconsin projects aiming to develop crash report forms was to redesign the old MV4000 crash report form into the new DT4000 crash report form. The modification was applied from January 1, 2017, statewide. The reason behind this switch is to resolve some matters with the old MV4000 crash report form, including insufficient reporting in roadway-related data fields, lack of data fields describing driver distraction, intersection type, no specification of the exact traffic barrier, insufficient information regarding safety equipment usage by motorists and non-motorists, unclear information about the crash location, and inadequate evidence concerning non-motorists actions, circumstances and condition prior to the crash. Hence, the new DT4000 crash form modified some existing data fields incorporated new crash elements and more detailed attributes. The modified and new data fields, their associated attribute values have been thoroughly studied and the effectiveness of improved data collection in terms of a better understanding of factors associated with and contributing to VRU crashes has been comprehensively evaluated. The evaluation has confirmed that the DT4000 crash form provided more specific, details, and useful about the crash circumstances

    Urban Informatics

    Get PDF
    This open access book is the first to systematically introduce the principles of urban informatics and its application to every aspect of the city that involves its functioning, control, management, and future planning. It introduces new models and tools being developed to understand and implement these technologies that enable cities to function more efficiently – to become ‘smart’ and ‘sustainable’. The smart city has quickly emerged as computers have become ever smaller to the point where they can be embedded into the very fabric of the city, as well as being central to new ways in which the population can communicate and act. When cities are wired in this way, they have the potential to become sentient and responsive, generating massive streams of ‘big’ data in real time as well as providing immense opportunities for extracting new forms of urban data through crowdsourcing. This book offers a comprehensive review of the methods that form the core of urban informatics from various kinds of urban remote sensing to new approaches to machine learning and statistical modelling. It provides a detailed technical introduction to the wide array of tools information scientists need to develop the key urban analytics that are fundamental to learning about the smart city, and it outlines ways in which these tools can be used to inform design and policy so that cities can become more efficient with a greater concern for environment and equity
    • 

    corecore