A Probabilistic Geocoding System based on a National Address File

Abstract

Abstract. It is estimated that between 80 % and 90 % of governmental and business data collections contain address information. Geocoding – the process of assigning geographic coordinates to addresses – is becoming increasingly important in many application areas that involve the analysis and mining of such data. In many cases, address records are captured and/or stored in a free-form or inconsistent manner. This fact complicates the task of robustly matching such addresses to spatiallyannotated reference data. In this paper we describe a geocoding system that is based on a comprehensive high-quality geocoded national address database. It uses a learning address parser based on hidden Markov models to separate free-form addresses into components, and a rule-based matching engine to determine the best set of candidate matches to a reference file. The geocoding software modules are implemented (as part of the Febrl open source data linkage system) in the object-oriented language Python, which allows rapid prototype development and testing

    Similar works

    Full text

    thumbnail-image

    Available Versions