Data Quality Methods






Keywords: Data quality, data cleansing, data scrubbing, warehousing, legacy data, total quality management.






Table of Contents

1.0 Introduction to Data Quality and Clean-up

1.1 Purpose and Scope

1.2 Data Quality Defined

2.0 Data Quality Methods

2.1 Methods Strategy Outline

2.2 Data Quality Project Phases

Preliminary Problem Identification (Task 1.0)

Define Data Issues (Task 2.0)

Analyze (Task 3.0)

Improve (Task 4.0)

2.3 Roles and Responsibilities

3.0 Methods Suggestions by Project Type

3.1 Projects: Future Data Quality

Improve the Request for Data from Outside Organizations

Improve Data Provided by Outside Organizations

Data Entry

3.2 Projects: Legacy Data Cleaning

Generic Current Data Cleanup

Clean Legacy Data Within One System

Clean Legacy Data in a Set of Systems

Clean Legacy Data Export to a Warehouse

Clean Current Warehouse Data

Data Quality Methods



Richard J. Orli

Based on a Public Document prepared for the US Gov't.,

this version © 1996 richard j. orli

1.0 Introduction to Data Quality and Clean-up



Data quality relies on a chain of events: what is asked for and how, how it is entered, what is edited and when, and finally how data is reviewed, stored, and used. Since data quality is affected by virtually every part of the business process, data quality is an integral part of what must be a Total Quality Management concept.

Clean data is the result of some combination of two efforts: making sure that data entered into the system are clean, and cleaning up problems once the data has been accepted. The first part is best interactive and immediate, although it may include some "close of business day" review and rework processes. The second type of effort is, fundamentally, a batch data cleanup exercise.

Clean data is also the result of a shared sense of responsibility and never-ceasing vigilance. Every time data is entered to the system, the person doing the entry should feel responsibility for and pride in the quality of the data and success of the mission. Managers responsible for a business area must be aware of their responsibilities for the data around which their work is structured. Business and technical experts must realize that anytime the business changes, the data might change. Responsibility is typically defined as a combination of data stewardship and ownership.

Data quality is one part of a larger data management process, which is concerned not only with the quality but the accessibility of data. At many organizations, the data administration function is the chief instrument for administrating data standards and recommending data methodologies. However, every organization must participate in data stewardship and ownership responsibilities.

1.1 Purpose and Scope

This paper briefly summarizes a basic strategy for proactive data cleansing. The main intent is to provide information that will be helpful in planning a data cleanup project. To that end, suggestions are provided in a "cookbook" format. It must be understood that each project will be unique, and the specifics of each plan will vary.

This data quality task is structured as one of problem identification and resolution. However, the same steps can be followed as part of a comprehensive data quality assurance program that identifies and corrects data issues before they become problems.

Data quality is defined, and the general methodology is outlined. The methodology stages are supported with the various "aids" to help define the problem, assist with analysis steps, and place the project. In addition, steps for each type of data quality project are suggested.



1.2 Data Quality Defined

Quality data is, simply put, data that meets business needs. According to Andrew Ippilito, data has a number of quality characteristics:

Accuracy

1) The measure or degree of agreement between a data value (or set of values) and a source assumed to be correct.

2) A qualitative assessment of freedom from error.

Completeness

1) The degree to which values are present in the attributes that require them.

Consistency

1) Data are maintained so they are free from variation or contradiction.

2) The measure of the degree to which a set of data satisfies a set of constraints.

Timeliness

1) The extent to which a data item or multiple items are provided at the time required or specified

2) A synonym for currency, the degree to which specified values are up to date.

Uniqueness

1) The ability to establish the uniqueness of a data record (and data key values).

Validity

1) The quality of the maintained data is rigorous enough to satisfy the acceptance requirements of the classification criteria.

2) A condition where the data values pass all edits for acceptability, producing desired results.

Data must be properly labeled and defined to be meaningful. (That is to say, the data describing the data - metadata - must be accurate.) For example, if a field is labeled as "Address," it ought not have a telephone number in it. It has been said that metadata is a description of information that planners and designers wish would be put into the database - it often happens that data fields are used for a variety of unintended purposes.

Quality data does not necessarily mean perfect data. It is essential to set quality expectations, especially in a warehouse setting where deliberate tradeoffs must often be made among speed, convenience and accuracy. For example, data may be updated weekly, and so results of a query may be some days out-of-date. Quality standards may be communicated in several ways, for example:

o Publish data transformation rules for each area (via metadata)

o Warn of potential missing data sources

o Clearly establish update schedules

o Publish accuracy and deviation results from controlled tests.
Data Cleaning Issues by Major Data Categories

In order of cleanliness:

1) Quantities. Quantity in a database can be expressed as either integers or decimals. If one data source has more decimal places than another, rounding, truncation, and promotion rules will be needed. If units differ in databases (occupants as persons in one, and heads of households (representing families) in another), translation can be difficult or error-prone.

2) Encoded data. If the encoding scheme has changed over time, the early values may have lost their original meaning. If the change is an expansion of the original encoding scheme, this may not be a problem, but codes can be reused. An industry or organization-standard code is beneficial, since it can be looked up. However standard codes also change, and can be misused. The next most common change in codes is total replacement with a new scheme. This problem can often be handled with a look-up table.

3) Structured text. For example, addresses. Often of poor quality unless carefully edited before and/or after data entry.

4) Free text. Requires interpretation to be usable in a database or warehouse context.

Source: Joe Chelko, Datamation, Oct. 95







Typical objectives of a data quality program include:

o Eliminate redundant data

o Validate data at input

o Eliminate false null values

o Ensure data values fall within defined domains

o Resolve conflicts in data

o Ensure proper definition and use of data values

o Establish and apply standards.









2.0 Data Quality Methods and Approach



2.1 Methods Strategy Outline

The strategy is basic: preliminary Problem Definition, followed by Data Issues Identification, Analysis, and Improvement steps for each problem. This is illustrated in Figure 2.1.







































For each problem area, the data issues are identified in detail. An important part of the process identifies exactly where in the data life cycle the problem originates. Often a data quality problem requires two separate efforts, a project to correct data that already exists, and a project to correct the cause behind the data problems.

The illustration of a typical data life cycle in Section 3, may be useful as a template against which the problem origination point can be identified (described in detail in Section 3). Once identified, each problem area can become a project or subproject.

Each area has a set of common problems, to which a specialized set of methods can be applied.

Each of the areas of the typical life cycle share common analysis needs and can be approached in a similar way. Some methods for each area are suggested in Section 3.





2.2 Data Quality Project Phases

A systems life cycle view of the work typically required in a data quality project is presented in

Figure 2.3.

2.2.1 Preliminary Problem Identification (Task 1.0)

Task 1.1 Define preliminary issues and objectives

Task 1.2 Assign problem assessment resources.

2.2.2 Define Data Issues (Task 2.0)

Scope and Problem Identification

Task 2.1 Research data quality issues.

Task 2.2 Define objectives and scope

Formal statement based on management goals and committed resources, given the knowledge of the system built through research conducted in the tasks above.

Task 2.3 Assess cost/benefits

Task 2.4 Assign Project Resources

Task 2.5 Finalize project plan

Prepare formal project plan with objectives, measures of success, and a schedule of deliverables.

2.2.3 Analyze (Task 3.0)

Each of the areas of the typical life cycle share common analysis needs and can be approached in a similar way. Methods specific to each area are summarized in Section 3.

Task 3.1 Analyze Data Quality

Task 3.2 Describe solution options, high level design

Task 3.3 Recommend solution

Based on the assessment of the options, recommend a solution.

2.2.4. Improve (Task 4.0)

Task 4.1 Design Detailed Technical Solution

Task 4.2 Acquire Tools If applicable.

Task 4.3 Implement Automated System Improvements. Implement the solution.

Task 4.4 Implement manual process improvements

Task 4.5 Test and verify

2.3 Roles and Responsibilities

Figure 2.5 explains the role of the various participants in a data quality project that has been initiated (problem definition has been completed).



Figure 2.5 Roles and Responsibilities



Several types of participants must cooperate in a successful data cleanup project:

Functional and Business Users or Experts

These include originators and stewards of the data and the business processes that create and use the data. These individuals also sponsor and represent management sponsors of the data quality improvement project.

Quality Team

Staff delegated to accomplish the data quality effort. This may consist of individuals with functional and systems expertise, as well as representatives from other organizations, such as IT and other business areas with an interest in the data. The quality team may be formally designated, or in a smaller project, may be simply the set of those individuals who have been assigned the data quality task.

Systems Organizations

IT or technical staff within the business areas.




Continue to Section 3 (page under construction, please check again after Sept 20)






Return to KISMET Home Page.


Last Updated on September 3, 1996. By: info@kismeta.com
Click to request more information!