Keywords: Address, Data Quality, Address Standards, Geocode, GIS, Geographic, Warehouse
Based on a public US Gov't document. Authors: R. Orli, L. Blake, F. Santos, A. Ippilito, this version © 1996 richard j. orli
Summary
Data quality is a fundamental data management objective, and location information is
essential to many key application. Location is also key to many statistical and management
interests, and is indispensable for geographic information systems. This document helps
lay the groundwork necessary to achieve address standardization and improve related data
quality. Topics include address format standards, editing standards, and the application
of geocoding.
The problem faced by anyone comparing address information across systems is that addresses are inconsistent and often of poor quality. The inconsistency also means that all of the information cannot be fully compared, and in some cases cannot be trusted to be accurate. In addition, the enterprise pays the cost of redundant efforts. In may organizations, the same address is geocoded multiple times, and multiple independent efforts must be maintained to manage data extracts and transforms.
This document, derived from one prepared originally for the U.S. Department of Housing
and Urban Development, covers the following issues:
Section 1 Sections 2 Section 3
The address format standard includes two options: in the simplest possible format for normal use, and parsed for special purposes, such as to accommodate data received in parsed format. The geocode standard format will meet most general purposes.
Section 4 describes on-going data management practices that will be required, and explains the rationale behind the proposed standards.
Section 5 specifies "generic" data clean up procedures, routines, and suggests software modules that would target improving the quality of existing data.
Appendix 1 lists important abbreviations and USPS codes
Appendix 2 summarizes USPS recommendations
Address Quality Standards
1.0
Address data quality is essential to almost every Enterprise business. This document helps
lay the groundwork necessary to achieve Enterprise-wide address standardization and
improve related data quality. Sections describe address format standards, editing
standards, and the application of geocoding.
Problem Statement. Address information across Enterprise systems are inconsistent and often of poor quality. The inconsistency also means that all of the information cannot be fully compared, and in some cases cannot be trusted to be accurate. Problems include:
- Inconsistent format (address and geocoding fields with different scope and detail level)
- Low quality (defined as not useable or geocodable)
- Null data (addresses that should have been collected were not)
- Non-collection (addresses are not collected as a matter of practice).
In addition, most organizations pay the cost of redundant efforts. Often, the same address
is geocoded multiple times. Several source systems and several "warehouse" systems
geocode data independently. One system will geocode data and discard a piece of
information that it may not specifically need (e.g. map coordinates), and another system
will geocode the same data again to capture previously discarded information. The fact
that information is retained in several formats means that a unique up-loading and
interpretation/translation system must be designed for each source.
Standards must recognize that address requirements may vary depending on business
purpose. If the enterprise has an interest in a specific property - as owner, mortgage
insurer, subsidizer or similar - a specific place address (not just a Post Office box number)
must exist. For many purposes, however, address is correctly a business or mailing
address, which may well be a PO Box.
Scope. The proposals presented here describe standards for all address and geocode
information collected and stored in the enterprise's automated systems.
2.0
2.1 Current Address Formats
A range of address formats are typically in use. A substantial range can be found even
within one system, for instance, one system with close to ten addresses in various files
contains the following address format:
Street address number suffix
Street Pre-compass point
Street name
Street type
Street post compass point
City, State, Zip+4
The most typical format in that system is:
Address Line 1
Address Line 2
Address Line 3
Location, State, Zip+4
But one record-type contains only:
Address Line 1
Address Line 2
The fact that these formats differ is not necessarily incorrect - there may be valid business
or technical reasons behind the selection of these formats. However, standard formats
should be adhered to in the absence of compelling reasons for exceptions. The reasons
behind any exceptions should be well documented.
Other sample address formats are listed in Appendix A.
2.2 Measuring Address Data Quality
Several terms, which are defined further in the Glossary, are used to describe address data quality:
o PRESENT or COMPLETE (an address has been entered)
o DELIVERABLE (by the US Postal Service, reliably)
o CORRECT or fully USEABLE (the address is a validly formatted combination of information such as city, state, Zip; using proper spellings and abbreviations)
o GEOCODABLE (by a commercial geocoding service or software)
o ACCURATE (the address is to the right place).
The focus of Enterprise address data quality is useability or correctness; however, this
quality is difficult to measure in entirety. Aspects of correctness, such as correct
combinations of city and Zip codes, can be checked against Geographic Code tables and
measured. Another standard of address data quality, "geocodability", has been often
used. While it is easy to measure, the drawback of using geocodability as a gauge is that an
address may be correct and not geocodable; or it may be geocodable and not accurate.
3.0
This section proposes address and editing standards. The standards suggested are not mandatory for existing systems. However, these standards may need to be applied if the existing system must feed or otherwise interact with other systems, and should be applied if the system is substantially revised for other purposes. Standards applicability to existing systems will be determined on a case by case basis.
These standards apply both to addresses intended primarily for business contact (mailing address) and those intended to locate a specific property of interest (e.g. street address of a house owned). It is strongly recommended that systems explicitly collect both types as applicable.
3.1 Proposed Address Formats and Edits
An address contains four parts:
| Addressee | Individual name
Organization name Department name Title, etc. |
| Geographic Street Address | Street Number,
Street/Route/PO Box, APO, URB name(1), etc. Pre direction Post direction |
| Secondary Unit Designator | Unit#, Bldg. #, Apt. #, etc.
Secondary Unit Number |
| Geographic
Location
Address |
Location (City)(2),
State, Zip+4. |
Addressee
Since addressee information is not an essential part of the
place-identification portion of an address it is outside of the focus of these
standards. Addressee requirements are not further considered in these
standards beyond asserting that addressee information should always be
placed in the top line(s) of an address block.
Secondary Unit Designator
Secondary units include Unit, Bldg., Apt., Mail Stop, Suite, and so on. (A complete valid list is included in Appendix B.)
We separate Secondary Unit Designator as a distinct item to encourage correct data entry. (Some address-data and geocoding experts recommend presenting this field before street address. They argue that the factor most likely to cause geocoding failure is incorrectly placed secondary unit designator information, and by drawing the attention of the person entering the address first, errors will be reduced .)
It should be understood that the USPS recommends placing the Secondary
Unit Designator on the end of the same line as the street address
information, and that is our recommendation for printing mailing labels,
where practical. Mailing label standards are discussed further below.
Geographic Street and Location Address Standard Format
The street address identifies specific location or property. A parsed (not
recommended) standard format for data entry and storage is shown below.
Geographic Address Standards - Option 1: Parsed
"Parsed" data may be acceptable or preferred in several instances, for
example when data is already available from sources in formats in which
"address line" is parsed into several components. Data received parsed and
preformatted may be retained without changes, using Enterprise-standard
data names and USPS-suggested field lengths. The parsed approach is also
useful when data must be specially consistent by street address, for example
to support address comparison searches as part of a fraud detection
program. Unparsed street address data, shown in Option 2 below, are preferred
whenever the principal purpose for an address is to provide a mail contact
point or to provide an accurate basis for geocoding. The format is
deliberately simple to decrease the chance of interpretation errors during
data entry. A single line for street address, if practical, will reduce errors.
Parsing data for database storage should be done as a routine practice only
for the specific business reasons discussed above. The unparsed approach is
STRONGLY RECOMMENDED.
Geographic Address Standard Format - Option 2: Unparsed
Street Number,
Pre-Direction,
Street Name/Rural Route/P.O. Box Number,
Street type,
Post Direction. (also known as "locator") Secondary Unit Type (Unit, Bldg., Apt., etc.)
Secondary Unit Number Legend: N = Numeric, A = Alphanumeric The following examples illustrate this format:
1234 PRIMACY PKWY
STE 201
MEMPHIS TN 12345 0001
1234 N FALATY PLATEAU HTS NE
UPPR 1000
UPPER RIVER INDIAN VILLAGE AZ 12345 1004
Recommended Edits
Several types and levels of edits may be practical, depending on
circumstances and business purpose.
1) Check entered data for valid abbreviations. (Abbreviation standards used
by the USPS are included in Appendix B.)
2) Compare entered location(City) and State to Zipcode (based on GCS or
equivalent table information).
3) Check Zipcode for validity (based on GCS or equivalent table
information).
4) Compare entered address against valid addresses:
o Against an existing database containing addresses (within the enterprise)
o Using COTS software modules, against a postal-service database of 140
million valid addresses.
5) Verify and correct the standard use of state code, standard spelling for
city; and presence of standard street type.
6) Inspect Street numbers that seem to represent ranges of addresses, such
as street numbers in a range or the use of terms such as "scattered sites".
(This only applies for those applications that receive addresses representing,
for example, blocks of apartments).
7) Identify and correct building name substitutions for street addresses to
the extent possible.
8) If County Code is missing, generate County Code.
9) Identify where range of latitude or longitude is more than 5 miles.
Inspect and correct. (This is a way to measure if the geocoding center is of a
Zip code, rather than to a specific street address. This is unnecessary if the
geocoding level is specified in a code, as is recommended).
10) Identify and delete official verbiage. For example: "Township of", "The
Commonwealth of", "The Great State of".
11) Comma Check. The USPS recommends not using commas or other
dividers within addresses, except the hyphen in Zip+4. The USPS further
recommends all capital letters, to aid machine readability.
12) Enforce Business Rules. For example, it may be a rule that P.O. Box
numbers (and equivalent) may not substitute for Street names (and
equivalent) if the address is for a property in which the enterprise holds an
interest (as opposed to the mailing address of an individual or organization).
3.2 Mailing Address Standards (For printing mailing labels)
Regardless of the storage format used, we recommend adherence to all
USPS suggestions for correctly formatting mailing address labels. As
mentioned above, these include placing the Secondary Unit Designator on
the end of the same line as the street address information (or above the
street address), not using commas or other dividers within addresses, and
use of all capital letters. Other USPS suggestions include use of Zip+4, and
for bulk mailings, barcoding.
USPS mailing address recommendations are reproduced in Appendix B.
3.3 Geocode Data Format and Editing Standards
The following information should be added or related to applicable address
databases. Geocoding is ideally sponsored by the business organization
served by the source system. However, we recommends establishment and
use of a central geocoding service that would help ensure consistent
standards and information use, help control quality, and standardize the
level to which geocoding is conducted.(4)
Geocode Data Format
This list is intended to represent the "lowest common denominator" of
geocoding data that all systems should maintain. In addition, systems may
optionally use additional geocode data, including but not limited to the
following examples: Sample Optional Geocode Data
State Name A30
County Name A30
Place Name A30
MSA Name A30
Central City Code N5
State Population Code N1
County Population Code N1
Place Population Code N1
Population Count N8
Type Place Code N1
Geocode Level Code N1
Units Count N5 Most of these examples are attributes of key geocodes, for example
COUNTY POPULATION CODE describes a county uniquely identified by
the County Code. Others, such as CENSUS PLACE CODE are alternative
coding schemes that may be needed for specific business purposes. Geocode Editing and Management
Several geocode features may change over time. Timely update is a key
issue, as is the need for maintaining a stable basis for statistical comparison
over time.
1) Congressional District. A significant number change after every 10 year
census. A very few may change at irregular intervals (in response to court
challenges to district composition).
2) Census Blocks, Sites and Tracts. Change rarely, but may be adjusted or
added to after every 10 year census.
3) MSA (metropolitan Statistical Area), Place codes, Counties. May change
or be added at irregular intervals, and should be reviewed and updated at
least yearly.
4) Latitude and Longitude. Ensure that data provided is in correct order,
and that "-" marks are properly placed.
4.0 Address data quality relies on a chain of events: what is asked for and how,
how it is entered, what is edited and when, and finally how data is reviewed,
stored, and used. Data quality is the shared responsibility of information
providers and data entry staff as well as the data stewards. The following
paragraphs highlight techniques for ensuring good sources for the address
data, correct data entry, and maintaining data quality through management
practices.
4.1 Data Source
Since the data source is often outside of the enterprise, control is
complicated. Source data quality can be aided by clear instructions and
statements of expectations to the information providers (such as applicants).
The information request must be clear and the form must be uncluttered
and well-blocked.
A strategy that has been successfully used is provision of free data entry
software to important data sources, with optional on-line updates to systems.
This approach allows greater control over the preediting of submitted data.
4.2 Data Entry
Correct data entry can be managed by:
1) Well designed information collection processes. For data entry
performed by someone who is not personally familiar with the address,
experts (e.g. U.S. Postal Service, Census Bureau, and commercial Geocoding
services) agree that the simplest format is best. The rationale is that a
person should be allowed to enter the address as they receive it, since
forcing data entry staff to parse data leads to more errors. In other words,
software performs the parsing task more reliably than humans.
A second recommendation is that, wherever applicable, both a
mailing/business address and a place address should be collected. In cases
where both are the same by coincidence, the mailing address may also be the
place address as a data entry convention, but the place address should be
retained separately as well.
2) Incorporated data edits. (Using GCS, Commercial off-the-shelf
software, etc.) For example, compare the entered state/city to Zipcode.
3) Data entry operator training. For example, what to look for, standards,
recommended data sequence, and dealing with edits.
4) Effective quality assurance procedures. For example, when supervisors
should check, statistical/quality sampling, incentives as applicable. Data
that has been entered can be reviewed and edited in batch mode. Failed
items are returned for correction or further review.
Specific data entry recommendations include:
1) Zip code entry first, with automatic fill of State and (optionally) locality
data.
2) Support on-line entry with help screens, pop-up valid values access, and
immediate edits.
3) Secondary unit data entry separate from street address (optionally before
street for emphasis)
4) Addresses entered with manual overrides of edits should be flagged for
future review.
5) Allow search for Zip code given City and State (optional).
Address Standards Appendix
Return to KISMET Home Page
1. 2. 3. 4. 5. 6.
Street Number A10
Street, box, or route number or letter
Street PreDirection A2
e.g. N, E, S, W, NE, SE, NW, SW
Street Name A55
Street name or "Route", "PO Box", APO/FPO,
URB code etc.
Street Type A4
e.g. Aly, Ln, Rd, Blvd, Pkwy etc.
Street Post Direction
A2
e.g. N, E, S, W, NE, SE, NW, SW
Secondary Unit
Designator A4
e.g. Unit, Apt., Bldg., etc.
Secondary Unit
Number A10
Number or letter.
Place Name A28
City or other locality Name.
State Code A2
USPS alpha state code
Zip Code N5
5-digit Zip code
Zip4 Code N4
4-digit Zipcode extension
International Postal
Code A14
Space for postal codes that may vary in length
(optional)
Street Address A90
Includes (in preferred order):
Secondary unit
designator A15
Includes (in preferred order):
Place Name A28
City or other place/locality Name
State Code A2
USPS alpha state code
Zip Code N5
5-digit Zip code
Zip4 Code N4
4-digit Zipcode extension
International Postal Code
A14
Space for postal codes that may vary in length
(optional)(3)
FIPS State Numeric
Code N2
FIPS State numeric code
FIPS County Code N3
FIPS County code
MCD/CCD Code N5
Census location code (MCD applies only to some
locations)
FIPS Place Code N5
FIPS place code(5). Includes city, town, village,
rural area, and other locality names.
Congressional District
Code N3
Congressional District for the current Census
period.
MSA Code N4
Census Metropolitan Statistical Area code(6)
Census Tract Code
N6.2
Census Tract - a well-defined geographical area
(finer identification than place code). In some
areas the equivalent is called Block Numbering
Area (BNA)
Block Code A5
Census Block - a set of street addresses. (The
first digit of Block is Block Group.)
Latitude N7.4
Map coordinate.
Longitude N8.4
Map coordinate.
Geocode Basis Code A1
What unit (e.g. street address, census tract,
5digit Zip code) is represented by the latitude
and longitude centroid.
Census Place Code N4
Last Updated September 5, 1996 by info@kismeta.com