17,923 research outputs found

    An investigation into machine learning approaches for forecasting spatio-temporal demand in ride-hailing service

    Full text link
    In this paper, we present machine learning approaches for characterizing and forecasting the short-term demand for on-demand ride-hailing services. We propose the spatio-temporal estimation of the demand that is a function of variable effects related to traffic, pricing and weather conditions. With respect to the methodology, a single decision tree, bootstrap-aggregated (bagged) decision trees, random forest, boosted decision trees, and artificial neural network for regression have been adapted and systematically compared using various statistics, e.g. R-square, Root Mean Square Error (RMSE), and slope. To better assess the quality of the models, they have been tested on a real case study using the data of DiDi Chuxing, the main on-demand ride hailing service provider in China. In the current study, 199,584 time-slots describing the spatio-temporal ride-hailing demand has been extracted with an aggregated-time interval of 10 mins. All the methods are trained and validated on the basis of two independent samples from this dataset. The results revealed that boosted decision trees provide the best prediction accuracy (RMSE=16.41), while avoiding the risk of over-fitting, followed by artificial neural network (20.09), random forest (23.50), bagged decision trees (24.29) and single decision tree (33.55).Comment: Currently under review for journal publicatio

    Grand Challenge: Real-time Destination and ETA Prediction for Maritime Traffic

    Full text link
    In this paper, we present our approach for solving the DEBS Grand Challenge 2018. The challenge asks to provide a prediction for (i) a destination and the (ii) arrival time of ships in a streaming-fashion using Geo-spatial data in the maritime context. Novel aspects of our approach include the use of ensemble learning based on Random Forest, Gradient Boosting Decision Trees (GBDT), XGBoost Trees and Extremely Randomized Trees (ERT) in order to provide a prediction for a destination while for the arrival time, we propose the use of Feed-forward Neural Networks. In our evaluation, we were able to achieve an accuracy of 97% for the port destination classification problem and 90% (in mins) for the ETA prediction

    High-Resolution Road Vehicle Collision Prediction for the City of Montreal

    Full text link
    Road accidents are an important issue of our modern societies, responsible for millions of deaths and injuries every year in the world. In Quebec only, in 2018, road accidents are responsible for 359 deaths and 33 thousands of injuries. In this paper, we show how one can leverage open datasets of a city like Montreal, Canada, to create high-resolution accident prediction models, using big data analytics. Compared to other studies in road accident prediction, we have a much higher prediction resolution, i.e., our models predict the occurrence of an accident within an hour, on road segments defined by intersections. Such models could be used in the context of road accident prevention, but also to identify key factors that can lead to a road accident, and consequently, help elaborate new policies. We tested various machine learning methods to deal with the severe class imbalance inherent to accident prediction problems. In particular, we implemented the Balanced Random Forest algorithm, a variant of the Random Forest machine learning algorithm in Apache Spark. Interestingly, we found that in our case, Balanced Random Forest does not perform significantly better than Random Forest. Experimental results show that 85% of road vehicle collisions are detected by our model with a false positive rate of 13%. The examples identified as positive are likely to correspond to high-risk situations. In addition, we identify the most important predictors of vehicle collisions for the area of Montreal: the count of accidents on the same road segment during previous years, the temperature, the day of the year, the hour and the visibility

    Proactive Assessment of Accident Risk to Improve Safety on a System of Freeways, Research Report 11-15

    Get PDF
    This report describes the development and evaluation of real-time crash risk-assessment models for four freeway corridors: U.S. Route 101 NB (northbound) and SB (southbound) and Interstate 880 NB and SB. Crash data for these freeway segments for the 16-month period from January 2010 through April 2011 are used to link historical crash occurrences with real-time traffic patterns observed through loop-detector data. \u27The crash risk-assessment models are based on a binary classification approach (crash and non-crash outcomes), with traffic parameters measured at surrounding vehicle detection station (VDS) locations as the independent variables. The analysis techniques used in this study are logistic regression and classification trees. Prior to developing the models, some data-related issues such as data cleaning and aggregation were addressed. The modeling efforts revealed that the turbulence resulting from speed variation is significantly associated with crash risk on the U.S. 101 NB corridor. The models estimated with data from U.S. 101 NB were evaluated on the basis of their classification performance, not only on U.S. 101 NB, but also on the other three freeway segments for transferability assessment. It was found that the predictive model derived from one freeway can be readily applied to other freeways, although the classification performance decreases. The models that transfer best to other roadways were determined to be those that use the least number of VDSs–that is, those that use one upstream or downstream station rather than two or three.\ The classification accuracy of the models is discussed in terms of how the models can be used for real-time crash risk assessment. The models can be applied to developing and testing variable speed limits (VSLs) and ramp-metering strategies that proactively attempt to reduce crash risk

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers

    Full text link
    Machine Learning (ML) algorithms are used to train computers to perform a variety of complex tasks and improve with experience. Computers learn how to recognize patterns, make unintended decisions, or react to a dynamic environment. Certain trained machines may be more effective than others because they are based on more suitable ML algorithms or because they were trained through superior training sets. Although ML algorithms are known and publicly released, training sets may not be reasonably ascertainable and, indeed, may be guarded as trade secrets. While much research has been performed about the privacy of the elements of training sets, in this paper we focus our attention on ML classifiers and on the statistical information that can be unconsciously or maliciously revealed from them. We show that it is possible to infer unexpected but useful information from ML classifiers. In particular, we build a novel meta-classifier and train it to hack other classifiers, obtaining meaningful information about their training sets. This kind of information leakage can be exploited, for example, by a vendor to build more effective classifiers or to simply acquire trade secrets from a competitor's apparatus, potentially violating its intellectual property rights

    Short-Term Travel Time Prediction on Freeways

    Get PDF
    Short-term travel time prediction supports the implementation of proactive traffic management and control strategies to alleviate if not prevent congestion and enable rational route choices and traffic mode selections to enhance travel mobility and safety. Over the last decade, Bluetooth technology has been increasingly used in collecting travel time data due to the technology’s advantages over conventional detection techniques in terms of direct travel time measurement, anonymous detection, and cost-effectiveness. However, similar to many other Automatic Vehicle Identification (AVI) technologies, Bluetooth technology has some limitations in measuring travel time information including 1) Bluetooth technology cannot associate travel time measurements with different traffic streams or facilities, therefore, the facility-specific travel time information is not directly available from Bluetooth measurements; 2) Bluetooth travel time measurements are influenced by measurement lag, because the travel time associated with vehicles that have not reached the downstream Bluetooth detector location cannot be taken at the instant of analysis. Freeway sections may include multiple distinct traffic stream (i.e., facilities) moving in the same direction of travel under a number of scenarios including: (1) a freeway section that contain both a High Occupancy Vehicle (HOV) or High Occupancy Toll (HOT) lane and several general purpose lanes (GPL); (2) a freeway section with a nearby parallel service roadway; (3) a freeway section in which there exist physically separated lanes (e.g. express versus collector lanes); or (4) a freeway section in which a fraction of the lanes are used by vehicles to access an off ramp. In this research, two different methods were proposed in estimating facility-specific travel times from Bluetooth measurements. Method 1 applies the Anderson-Darling test in matching the distribution of real-time Bluetooth travel time measurements with reference measurements. Method 2 first clusters the travel time measurements using the K-means algorithm, and then associates the clusters with facilities using traffic flow model. The performances of these two proposed methods have been evaluated against a Benchmark method using simulation data. A sensitivity analysis was also performed to understand the impacts of traffic conditions on the performance of different models. Based on the results, Method 2 is recommended when the physical barriers or law enforcement prevent drivers from freely switching between the underlying facilities; however, when the roadway functions as a self-correcting system allowing vehicles to freely switching between underlying facilities, the Benchmark method, which assumes one facility always operating faster than the other facility, is recommended for application. The Bluetooth travel time measurement lag leads to delayed detection of traffic condition variations and travel time changes, especially during congestion and transition periods or when consecutive Bluetooth detectors are placed far apart. In order to alleviate the travel time measurement lag, this research proposed to use non-lagged Bluetooth measurements (e.g., the number of repetitive detections for each vehicle and the time a vehicle spent in the detection zone) for inferring traffic stream states in the vicinity of the Bluetooth detectors. Two model structures including the analytical model and the statistical model have been proposed to estimate the traffic conditions based on non-lagged Bluetooth measurements. The results showed that the proposed RUSBoost classification tree achieved over 94% overall accuracy in predicting traffic conditions as congested or uncongested. When modeling traffic conditions as three traffic states (i.e., the free-flow state, the transition state, and the congested state) using the RUSBoost classification tree, the overall accuracy was 67.2%; however, the accuracy in predicting the congested traffic state was improved from 84.7% of the two state model to 87.7%. Because traffic state information enables the travel time prediction model to more timely detect the changes in traffic conditions, both the two-state model and the three-state model have been evaluated in developing travel time prediction models in this research. The Random Forest model was the main algorithm adopted in training travel time prediction models using both travel time measurements and inferred traffic states. Using historical Bluetooth data as inputs, the model results proved that the inclusion of traffic states information consistently lead to better travel time prediction results in terms of lower root mean square errors (improved by over 11%), lower 90th percentile absolute relative error ARE (improved by over 12%), and lower standard deviations of ARE (improved by over 15%) compared to other model structures without traffic states as inputs. In addition, the impact of traffic state inclusion on travel time prediction accuracy as a function of Bluetooth detector spacing was also examined using simulation data. The results showed that the segment length of 4~8 km is optimal in terms of the improvement from using traffic state information in travel time prediction models
    • …
    corecore