
Keywords: Data quality, data cleansing, data scrubbing, warehousing, legacy data, total
quality management.
1.0 Introduction to Data Quality and Clean-up
1.1 Purpose and Scope 1.2 Data Quality Defined 2.0 Data Quality Methods 2.1 Methods Strategy Outline 2.2 Data Quality Project Phases
Preliminary Problem Identification (Task 1.0)
Define Data Issues (Task 2.0)
Analyze (Task 3.0)
Improve (Task 4.0)
2.3 Roles and Responsibilities
3.0 Methods Suggestions by Project Type
3.1 Projects: Future Data Quality
Improve the Request for Data from Outside Organizations
Improve Data Provided by Outside Organizations
Data Entry
3.2 Projects: Legacy Data Cleaning
Generic Current Data Cleanup
Clean Legacy Data Within One System
Clean Legacy Data in a Set of Systems
Clean Legacy Data Export to a Warehouse
Clean Current Warehouse Data
Richard J. Orli
Based on a Public Document prepared for the US Gov't.,
this version © 1996 richard j. orli
1.0 Introduction to Data Quality and Clean-up
Data quality relies on a chain of events: what is asked for and how, how it is entered, what is edited and when, and finally how data is reviewed, stored, and used. Since data quality is affected by virtually every part of the business process, data quality is an integral part of what must be a Total Quality Management concept.
Clean data is the result of some combination of two efforts: making sure that data entered into the system are clean, and cleaning up problems once the data has been accepted. The first part is best interactive and immediate, although it may include some "close of business day" review and rework processes. The second type of effort is, fundamentally, a batch data cleanup exercise.
Clean data is also the result of a shared sense of responsibility and never-ceasing vigilance. Every time data is entered to the system, the person doing the entry should feel responsibility for and pride in the quality of the data and success of the mission. Managers responsible for a business area must be aware of their responsibilities for the data around which their work is structured. Business and technical experts must realize that anytime the business changes, the data might change. Responsibility is typically defined as a combination of data stewardship and ownership.
Data quality is one part of a larger data management process, which is concerned not only
with the quality but the accessibility of data. At many organizations, the data
administration function is the chief instrument for administrating data standards and
recommending data methodologies. However, every organization must participate in data
stewardship and ownership responsibilities.
1.1 Purpose and Scope
This paper briefly summarizes a basic strategy for proactive data cleansing. The main intent is to provide information that will be helpful in planning a data cleanup project. To that end, suggestions are provided in a "cookbook" format. It must be understood that each project will be unique, and the specifics of each plan will vary.
This data quality task is structured as one of problem identification and resolution. However, the same steps can be followed as part of a comprehensive data quality assurance program that identifies and corrects data issues before they become problems.
Data quality is defined, and the general methodology is outlined. The methodology stages
are supported with the various "aids" to help define the problem, assist with analysis steps,
and place the project. In addition, steps for each type of data quality project are
suggested.
1.2 Data Quality Defined
Quality data is, simply put, data that meets business needs. According to Andrew Ippilito, data has a number of quality characteristics:
Accuracy
1) The measure or degree of agreement between a data value (or set of values) and a source assumed to be correct.
2) A qualitative assessment of freedom from error.
Completeness
1) The degree to which values are present in the attributes that require them.
Consistency
1) Data are maintained so they are free from variation or contradiction.
2) The measure of the degree to which a set of data satisfies a set of constraints.
Timeliness
1) The extent to which a data item or multiple items are provided at the time required or specified
2) A synonym for currency, the degree to which specified values are up to date.
Uniqueness
1) The ability to establish the uniqueness of a data record (and data key values).
Validity
1) The quality of the maintained data is rigorous enough to satisfy the acceptance requirements of the classification criteria.
2) A condition where the data values pass all edits for acceptability, producing desired
results.
Data must be properly labeled and defined to be meaningful. (That is to say, the data
describing the data - metadata - must be accurate.) For example, if a field is labeled as
"Address," it ought not have a telephone number in it. It has been said that metadata is a
description of information that planners and designers wish would be put into the database
- it often happens that data fields are used for a variety of unintended purposes.
Quality data does not necessarily mean perfect data. It is essential to set quality expectations, especially in a warehouse setting where deliberate tradeoffs must often be made among speed, convenience and accuracy. For example, data may be updated weekly, and so results of a query may be some days out-of-date. Quality standards may be communicated in several ways, for example:
o Publish data transformation rules for each area (via metadata)
o Warn of potential missing data sources
o Clearly establish update schedules
o Publish accuracy and deviation results from controlled tests.
| Data Cleaning Issues by Major Data Categories
In order of cleanliness:
1) Quantities. Quantity in a database can be expressed as either integers or decimals. If
one data source has more decimal places than another, rounding, truncation, and
promotion rules will be needed. If units differ in databases (occupants as persons in one,
and heads of households (representing families) in another), translation can be difficult
or error-prone.
2) Encoded data. If the encoding scheme has changed over time, the early values may
have lost their original meaning. If the change is an expansion of the original encoding
scheme, this may not be a problem, but codes can be reused. An industry or
organization-standard code is beneficial, since it can be looked up. However standard
codes also change, and can be misused. The next most common change in codes is total
replacement with a new scheme. This problem can often be handled with a look-up
table.
3) Structured text. For example, addresses. Often of poor quality unless carefully
edited before and/or after data entry.
4) Free text. Requires interpretation to be usable in a database or warehouse context.
Source: Joe Chelko, Datamation, Oct. 95 |
Typical objectives of a data quality program include:
o Eliminate redundant data
o Validate data at input
o Eliminate false null values
o Ensure data values fall within defined domains
o Resolve conflicts in data
o Ensure proper definition and use of data values
o Establish and apply standards.
2.0 Data Quality Methods
and Approach
The strategy is basic: preliminary Problem Definition, followed by Data Issues
Identification, Analysis, and Improvement steps for each problem. This is illustrated in
Figure 2.1.
For each problem area, the data issues are identified in detail. An important part of the
process identifies exactly where in the data life cycle the problem originates. Often a data
quality problem requires two separate efforts, a project to correct data that already exists,
and a project to correct the cause behind the data problems.
The illustration of a typical data life cycle in Section 3, may be useful as a template against
which the problem origination point can be identified (described in detail in Section 3).
Once identified, each problem area can become a project or subproject.
Each area has a set of common problems, to which a specialized set of methods can be applied.
Each of the areas of the typical life cycle share common analysis needs and can be
approached in a similar way. Some methods for each area are suggested in Section 3.
2.2 Data Quality Project Phases
A systems life cycle view of the work typically required in a data quality project is presented in
Figure 2.3.
2.2.1 Preliminary Problem Identification (Task 1.0)
Task 1.1 Define preliminary issues and objectives
Task 1.2 Assign problem assessment resources.
2.2.2 Define Data Issues (Task 2.0)
Scope and Problem Identification
Task 2.1 Research data quality issues.
Task 2.2 Define objectives and scope
Formal statement based on management goals and committed resources, given the
knowledge of the system built through research conducted in the tasks above.
Task 2.3 Assess cost/benefits
Task 2.4 Assign Project Resources
Task 2.5 Finalize project plan
Prepare formal project plan with objectives, measures of success, and a schedule of
deliverables.
2.2.3 Analyze (Task 3.0)
Each of the areas of the typical life cycle share common analysis needs and can be
approached in a similar way. Methods specific to each area are summarized in Section 3.
Task 3.1 Analyze Data Quality
Task 3.2 Describe solution options, high level design
Task 3.3 Recommend solution
Based on the assessment of the options, recommend a solution.
2.2.4. Improve (Task 4.0)
Task 4.1 Design Detailed Technical Solution
Task 4.2 Acquire Tools If applicable.
Task 4.3 Implement Automated System Improvements. Implement the solution.
Task 4.4 Implement manual process improvements
Task 4.5 Test and verify
2.3 Roles and Responsibilities
Figure 2.5 explains the role of the
various participants in a data
quality project that has been
initiated (problem definition has
been completed).
Figure 2.5 Roles and
Responsibilities
Several types of participants must cooperate in a successful data cleanup project:
Functional and Business Users or Experts
These include originators and stewards of the data and the business processes that create
and use the data. These individuals also sponsor and represent management sponsors of
the data quality improvement project.
Quality Team
Staff delegated to accomplish the data quality effort. This may consist of individuals with
functional and systems expertise, as well as representatives from other organizations, such
as IT and other business areas with an interest in the data. The quality team may be
formally designated, or in a smaller project, may be simply the set of those individuals who
have been assigned the data quality task.
Systems Organizations
IT or technical staff within the business areas.
Continue to Section 3 (page under construction, please check again after Sept 20)
Return to KISMET Home Page.
Click to request more information!