Address Data Quality and Geocoding Standards




Keywords: Address, Data Quality, Address Standards, Geocode, GIS, Geographic, Warehouse




Based on a public US Gov't document. Authors: R. Orli, L. Blake, F. Santos, A. Ippilito, this version © 1996 richard j. orli

Summary

Data quality is a fundamental data management objective, and location information is essential to many key application. Location is also key to many statistical and management interests, and is indispensable for geographic information systems. This document helps lay the groundwork necessary to achieve address standardization and improve related data quality. Topics include address format standards, editing standards, and the application of geocoding.

The problem faced by anyone comparing address information across systems is that addresses are inconsistent and often of poor quality. The inconsistency also means that all of the information cannot be fully compared, and in some cases cannot be trusted to be accurate. In addition, the enterprise pays the cost of redundant efforts. In may organizations, the same address is geocoded multiple times, and multiple independent efforts must be maintained to manage data extracts and transforms.

This document, derived from one prepared originally for the U.S. Department of Housing and Urban Development, covers the following issues:

Section 1 introduces the Paper's purpose and scope.

Sections 2 highlights address quality problems and issues.

Section 3 describes standard address formats that are recommended for every source system (original source of data). This standard includes optional geocode data, as well as recommended edits. This section targets correcting systems to improve future data collection.

The address format standard includes two options: in the simplest possible format for normal use, and parsed for special purposes, such as to accommodate data received in parsed format. The geocode standard format will meet most general purposes.

Section 4 describes on-going data management practices that will be required, and explains the rationale behind the proposed standards.

Section 5 specifies "generic" data clean up procedures, routines, and suggests software modules that would target improving the quality of existing data.

Appendix 1 lists important abbreviations and USPS codes

Appendix 2 summarizes USPS recommendations

Address Quality Standards



1.0 Introduction to Problem and Scope

Address data quality is essential to almost every Enterprise business. This document helps lay the groundwork necessary to achieve Enterprise-wide address standardization and improve related data quality. Sections describe address format standards, editing standards, and the application of geocoding.

Problem Statement. Address information across Enterprise systems are inconsistent and often of poor quality. The inconsistency also means that all of the information cannot be fully compared, and in some cases cannot be trusted to be accurate. Problems include:

- Inconsistent format (address and geocoding fields with different scope and detail level)

- Low quality (defined as not useable or geocodable)

- Null data (addresses that should have been collected were not)

- Non-collection (addresses are not collected as a matter of practice).

In addition, most organizations pay the cost of redundant efforts. Often, the same address is geocoded multiple times. Several source systems and several "warehouse" systems geocode data independently. One system will geocode data and discard a piece of information that it may not specifically need (e.g. map coordinates), and another system will geocode the same data again to capture previously discarded information. The fact that information is retained in several formats means that a unique up-loading and interpretation/translation system must be designed for each source.

Standards must recognize that address requirements may vary depending on business purpose. If the enterprise has an interest in a specific property - as owner, mortgage insurer, subsidizer or similar - a specific place address (not just a Post Office box number) must exist. For many purposes, however, address is correctly a business or mailing address, which may well be a PO Box.

Scope. The proposals presented here describe standards for all address and geocode information collected and stored in the enterprise's automated systems.

2.0 Address Formats

2.1 Current Address Formats

A range of address formats are typically in use. A substantial range can be found even within one system, for instance, one system with close to ten addresses in various files contains the following address format:

Street address number suffix

Street Pre-compass point

Street name

Street type

Street post compass point

City, State, Zip+4

The most typical format in that system is:

Address Line 1

Address Line 2

Address Line 3

Location, State, Zip+4

But one record-type contains only:

Address Line 1

Address Line 2

The fact that these formats differ is not necessarily incorrect - there may be valid business or technical reasons behind the selection of these formats. However, standard formats should be adhered to in the absence of compelling reasons for exceptions. The reasons behind any exceptions should be well documented.

Other sample address formats are listed in Appendix A.



2.2 Measuring Address Data Quality

Several terms, which are defined further in the Glossary, are used to describe address data quality:

o PRESENT or COMPLETE (an address has been entered)

o DELIVERABLE (by the US Postal Service, reliably)

o CORRECT or fully USEABLE (the address is a validly formatted combination of information such as city, state, Zip; using proper spellings and abbreviations)

o GEOCODABLE (by a commercial geocoding service or software)

o ACCURATE (the address is to the right place).

The focus of Enterprise address data quality is useability or correctness; however, this quality is difficult to measure in entirety. Aspects of correctness, such as correct combinations of city and Zip codes, can be checked against Geographic Code tables and measured. Another standard of address data quality, "geocodability", has been often used. While it is easy to measure, the drawback of using geocodability as a gauge is that an address may be correct and not geocodable; or it may be geocodable and not accurate.



3.0 Standards - Suggested handling of address data & storage

This section proposes address and editing standards. The standards suggested are not mandatory for existing systems. However, these standards may need to be applied if the existing system must feed or otherwise interact with other systems, and should be applied if the system is substantially revised for other purposes. Standards applicability to existing systems will be determined on a case by case basis.

These standards apply both to addresses intended primarily for business contact (mailing address) and those intended to locate a specific property of interest (e.g. street address of a house owned). It is strongly recommended that systems explicitly collect both types as applicable.

3.1 Proposed Address Formats and Edits

An address contains four parts:
Addressee Individual name

Organization name

Department name

Title, etc.

Geographic Street Address Street Number,

Street/Route/PO Box, APO, URB name(1), etc.

Pre direction

Post direction

Secondary Unit Designator Unit#, Bldg. #, Apt. #, etc.

Secondary Unit Number

Geographic Location

Address

Location (City)(2),

State,

Zip+4.





Addressee

Since addressee information is not an essential part of the place-identification portion of an address it is outside of the focus of these standards. Addressee requirements are not further considered in these standards beyond asserting that addressee information should always be placed in the top line(s) of an address block.

Secondary Unit Designator

Secondary units include Unit, Bldg., Apt., Mail Stop, Suite, and so on. (A complete valid list is included in Appendix B.)

We separate Secondary Unit Designator as a distinct item to encourage correct data entry. (Some address-data and geocoding experts recommend presenting this field before street address. They argue that the factor most likely to cause geocoding failure is incorrectly placed secondary unit designator information, and by drawing the attention of the person entering the address first, errors will be reduced .)

It should be understood that the USPS recommends placing the Secondary Unit Designator on the end of the same line as the street address information, and that is our recommendation for printing mailing labels, where practical. Mailing label standards are discussed further below.

Geographic Street and Location Address Standard Format

The street address identifies specific location or property. A parsed (not recommended) standard format for data entry and storage is shown below.

Geographic Address Standards - Option 1: Parsed
Street Number A10 Street, box, or route number or letter
Street PreDirection A2 e.g. N, E, S, W, NE, SE, NW, SW
Street Name A55 Street name or "Route", "PO Box", APO/FPO, URB code etc.
Street Type A4 e.g. Aly, Ln, Rd, Blvd, Pkwy etc.
Street Post Direction A2 e.g. N, E, S, W, NE, SE, NW, SW
Secondary Unit Designator A4 e.g. Unit, Apt., Bldg., etc.
Secondary Unit Number A10 Number or letter.
Place Name A28 City or other locality Name.
State Code A2 USPS alpha state code
Zip Code N5 5-digit Zip code
Zip4 Code N4 4-digit Zipcode extension
International Postal Code A14 Space for postal codes that may vary in length (optional)




"Parsed" data may be acceptable or preferred in several instances, for example when data is already available from sources in formats in which "address line" is parsed into several components. Data received parsed and preformatted may be retained without changes, using Enterprise-standard data names and USPS-suggested field lengths. The parsed approach is also useful when data must be specially consistent by street address, for example to support address comparison searches as part of a fraud detection program.

Unparsed street address data, shown in Option 2 below, are preferred whenever the principal purpose for an address is to provide a mail contact point or to provide an accurate basis for geocoding. The format is deliberately simple to decrease the chance of interpretation errors during data entry. A single line for street address, if practical, will reduce errors. Parsing data for database storage should be done as a routine practice only for the specific business reasons discussed above. The unparsed approach is STRONGLY RECOMMENDED.

Geographic Address Standard Format - Option 2: Unparsed
Street Address A90 Includes (in preferred order):

Street Number,

Pre-Direction,

Street Name/Rural Route/P.O. Box Number,

Street type,

Post Direction.

Secondary unit designator A15

(also known as "locator")

Includes (in preferred order):

Secondary Unit Type (Unit, Bldg., Apt., etc.)

Secondary Unit Number

Place Name A28 City or other place/locality Name
State Code A2 USPS alpha state code
Zip Code N5 5-digit Zip code
Zip4 Code N4 4-digit Zipcode extension
International Postal Code A14 Space for postal codes that may vary in length (optional)(3)


Legend: N = Numeric, A = Alphanumeric

The following examples illustrate this format:

1234 PRIMACY PKWY

STE 201

MEMPHIS TN 12345 0001

1234 N FALATY PLATEAU HTS NE

UPPR 1000

UPPER RIVER INDIAN VILLAGE AZ 12345 1004



Recommended Edits

Several types and levels of edits may be practical, depending on circumstances and business purpose.

1) Check entered data for valid abbreviations. (Abbreviation standards used by the USPS are included in Appendix B.)

2) Compare entered location(City) and State to Zipcode (based on GCS or equivalent table information).

3) Check Zipcode for validity (based on GCS or equivalent table information).

4) Compare entered address against valid addresses:

o Against an existing database containing addresses (within the enterprise)

o Using COTS software modules, against a postal-service database of 140 million valid addresses.

5) Verify and correct the standard use of state code, standard spelling for city; and presence of standard street type.

6) Inspect Street numbers that seem to represent ranges of addresses, such as street numbers in a range or the use of terms such as "scattered sites". (This only applies for those applications that receive addresses representing, for example, blocks of apartments).

7) Identify and correct building name substitutions for street addresses to the extent possible.

8) If County Code is missing, generate County Code.

9) Identify where range of latitude or longitude is more than 5 miles. Inspect and correct. (This is a way to measure if the geocoding center is of a Zip code, rather than to a specific street address. This is unnecessary if the geocoding level is specified in a code, as is recommended).

10) Identify and delete official verbiage. For example: "Township of", "The Commonwealth of", "The Great State of".

11) Comma Check. The USPS recommends not using commas or other dividers within addresses, except the hyphen in Zip+4. The USPS further recommends all capital letters, to aid machine readability.

12) Enforce Business Rules. For example, it may be a rule that P.O. Box numbers (and equivalent) may not substitute for Street names (and equivalent) if the address is for a property in which the enterprise holds an interest (as opposed to the mailing address of an individual or organization).



3.2 Mailing Address Standards (For printing mailing labels)

Regardless of the storage format used, we recommend adherence to all USPS suggestions for correctly formatting mailing address labels. As mentioned above, these include placing the Secondary Unit Designator on the end of the same line as the street address information (or above the street address), not using commas or other dividers within addresses, and use of all capital letters. Other USPS suggestions include use of Zip+4, and for bulk mailings, barcoding.

USPS mailing address recommendations are reproduced in Appendix B.

3.3 Geocode Data Format and Editing Standards

The following information should be added or related to applicable address databases. Geocoding is ideally sponsored by the business organization served by the source system. However, we recommends establishment and use of a central geocoding service that would help ensure consistent standards and information use, help control quality, and standardize the level to which geocoding is conducted.(4)

Geocode Data Format
FIPS State Numeric Code N2 FIPS State numeric code
FIPS County Code N3 FIPS County code
MCD/CCD Code N5 Census location code (MCD applies only to some locations)
FIPS Place Code N5 FIPS place code(5). Includes city, town, village, rural area, and other locality names.
Congressional District Code N3 Congressional District for the current Census period.
MSA Code N4 Census Metropolitan Statistical Area code(6)
Census Tract Code N6.2 Census Tract - a well-defined geographical area (finer identification than place code). In some areas the equivalent is called Block Numbering Area (BNA)
Block Code A5 Census Block - a set of street addresses. (The first digit of Block is Block Group.)
Latitude N7.4 Map coordinate.
Longitude N8.4 Map coordinate.
Geocode Basis Code A1 What unit (e.g. street address, census tract, 5digit Zip code) is represented by the latitude and longitude centroid.


This list is intended to represent the "lowest common denominator" of geocoding data that all systems should maintain. In addition, systems may optionally use additional geocode data, including but not limited to the following examples:

Sample Optional Geocode Data
Census Place Code N4

State Name A30

County Name A30

Place Name A30

MSA Name A30

Central City Code N5

State Population Code N1

County Population Code N1

Place Population Code N1

Population Count N8

Type Place Code N1

Geocode Level Code N1

Units Count N5





Most of these examples are attributes of key geocodes, for example COUNTY POPULATION CODE describes a county uniquely identified by the County Code. Others, such as CENSUS PLACE CODE are alternative coding schemes that may be needed for specific business purposes.

Geocode Editing and Management

Several geocode features may change over time. Timely update is a key issue, as is the need for maintaining a stable basis for statistical comparison over time.

1) Congressional District. A significant number change after every 10 year census. A very few may change at irregular intervals (in response to court challenges to district composition).

2) Census Blocks, Sites and Tracts. Change rarely, but may be adjusted or added to after every 10 year census.

3) MSA (metropolitan Statistical Area), Place codes, Counties. May change or be added at irregular intervals, and should be reviewed and updated at least yearly.

4) Latitude and Longitude. Ensure that data provided is in correct order, and that "-" marks are properly placed.





4.0 Managing Data Quality

Address data quality relies on a chain of events: what is asked for and how, how it is entered, what is edited and when, and finally how data is reviewed, stored, and used. Data quality is the shared responsibility of information providers and data entry staff as well as the data stewards. The following paragraphs highlight techniques for ensuring good sources for the address data, correct data entry, and maintaining data quality through management practices.

4.1 Data Source

Since the data source is often outside of the enterprise, control is complicated. Source data quality can be aided by clear instructions and statements of expectations to the information providers (such as applicants). The information request must be clear and the form must be uncluttered and well-blocked.

A strategy that has been successfully used is provision of free data entry software to important data sources, with optional on-line updates to systems. This approach allows greater control over the preediting of submitted data.

4.2 Data Entry

Correct data entry can be managed by:

1) Well designed information collection processes. For data entry performed by someone who is not personally familiar with the address, experts (e.g. U.S. Postal Service, Census Bureau, and commercial Geocoding services) agree that the simplest format is best. The rationale is that a person should be allowed to enter the address as they receive it, since forcing data entry staff to parse data leads to more errors. In other words, software performs the parsing task more reliably than humans.

A second recommendation is that, wherever applicable, both a mailing/business address and a place address should be collected. In cases where both are the same by coincidence, the mailing address may also be the place address as a data entry convention, but the place address should be retained separately as well.

2) Incorporated data edits. (Using GCS, Commercial off-the-shelf software, etc.) For example, compare the entered state/city to Zipcode.

3) Data entry operator training. For example, what to look for, standards, recommended data sequence, and dealing with edits.

4) Effective quality assurance procedures. For example, when supervisors should check, statistical/quality sampling, incentives as applicable. Data that has been entered can be reviewed and edited in batch mode. Failed items are returned for correction or further review.

Specific data entry recommendations include:

1) Zip code entry first, with automatic fill of State and (optionally) locality data.

2) Support on-line entry with help screens, pop-up valid values access, and immediate edits.

3) Secondary unit data entry separate from street address (optionally before street for emphasis)

4) Addresses entered with manual overrides of edits should be flagged for future review.

5) Allow search for Zip code given City and State (optional).




Address Standards Appendix




Return to KISMET Home Page


Last Updated September 5, 1996 by info@kismeta.com

1. APO/FPO: Defense Department Addresses; URB: Urbanization Projects in Puerto Rico.

2. The "Location" information may be city, village, rural area or route, tribal reservation or other place type. The Location name may be any commonly used local name; for geocoding or statistical assessment purposes either the Census Place Code, or other measure such as county or Zipcode shall be used.

3. This field is recommended whenever there is a possibility that a foreign postal code will be needed.

4. IT has made some progress toward creating a central geocoding assistance function by sponsoring a contract with a geocoding service that may be used by any organization. Geocoding software is also under investigation.

5. The FIPS and CENSUS Place codes are in different formats.

6. MSA as used here includes CMSA, PMSA, NECMA, and MSA as defined in FIPS 8-6.