Outlier Detection and Missing Value Estimation in Time Series Traffic Count Data: Final Report of SERC Project GR/G23180.

Abstract

A serious problem in analysing traffic count data is what to do when missing or extreme values occur, perhaps as a result of a breakdown in automatic counting equipment. The objectives of this current work were to attempt to look at ways of solving this problem by: 1)establishing the applicability of time series and influence function techniques for estimating missing values and detecting outliers in time series traffic data; 2)making a comparative assessment of new techniques with those used by traffic engineers in practice for local, regional or national traffic count systems Two alternative approaches were identified as being potentially useful and these were evaluated and compared with methods currently employed for `cleaning' traffic count series. These were based on evaluating the effect of individual or groups of observations on the estimate of the auto-correlation structure and events influencing a parametric model (ARIMA). These were compared with the existing methods which included visual inspection and smoothing techniques such as the exponentially weighted moving average in which means and variances are updated using observations from the same time and day of week. The results showed advantages and disadvantages for each of the methods. The exponentially weighted moving average method tended to detect unreasonable outliers and also suggested replacements which were consistently larger than could reasonably be expected. Methods based on the autocorrelation structure were reasonably successful in detecting events but the replacement values were suspect particularly when there were groups of values needing replacement. The methods also had problems in the presence of non-stationarity, often detecting outliers which were really a result of the changing level of the data rather than extreme values. In the presence of other events, such as a change in level or seasonality, both the influence function and change in autocorrelation present problems of interpretation since there is no way of distinguishing these events from outliers. It is clear that the outlier problem cannot be separated from that of identifying structural changes as many of the statistics used to identify outliers also respond to structural changes. The ARIMA (1,0,0)(0,1,1)7 was found to describe the vast majority of traffic count series which means that the problem of identifying a starting model can largely be avoided with a high degree of assurance. Unfortunately it is clear that a black-box approach to data validation is prone to error but methods such as those described above lend themselves to an interactive graphics data-validation technique in which outliers and other events are highlighted requiring acceptance or otherwise manually. An adaptive approach to fitting the model may result in something which can be more automatic and this would allow for changes in the underlying model to be accommodated. In conclusion it was found that methods based on the autocorrelation structure are the most computationally efficient but lead to problems of interpretation both between different types of event and in the presence of non-stationarity. Using the residuals from a fitted ARIMA model is the most successful method at finding outliers and distinguishing them from other events, being less expensive than case deletion. The replacement values derived from the ARIMA model were found to be the most accurate

    Similar works

    This paper was published in White Rose Research Online.

    Having an issue?

    Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.