Key words: data extraction, data scrubbing, data migration, data warehousing, database modeling
Data Extraction, Cleansing and Migration Tools
to support
Data Warehouse, Database Consolidation, and Systems Reengineering Projects
Based on a public US Gov't document. Authors: R. Orli, F. Santos; this version © 1996 richard j.
orli
Summary. This information is intended to facilitate data extraction, cleansing and migration methods selection, with particular emphasis on tools. The main body of the document briefly summarizes issues and makes recommendations. The appendix presents an overview of applicable commercial tools. Our recommendation is to select a suite of tools, each one of which meet specific needs.
The Issue. The process of developing several data warehouses and other database projects has highlighted the need to effectively and efficiently manage the extraction, cleansing, transformation and migration of data from legacy systems. Effectiveness is necessary whenever the data is of great value (which is usually). Efficiency is necessary, because the long term investment of resources in these activities can be high.
Why tools? Tools can help achieve technical efficiency, and can perform data quality tasks that would otherwise be impractical. Just as important, tools such as Prism also provide a structure to the migration process that can be leveraged as an effective management tool.
The Question. What tools and techniques are most appropriate to assist with data extraction, Cleansing and Migration to support Data Warehouse, Database Consolidation, and Systems Reengineering Projects, given the business problem and environment?
The Decision Process. The purpose is not to document a decision that has been made but to lay out the issues to facilitate discussion, and to make recommendations in support of the decision process.
There is not one decision to make, but several in a series. Most of the decisions are not about technical choices, but about management choices. On the face of it, the key decision seems to be: do we need a
But in fact, both are needed. In an emergency situation, the "quick solution" is necessary. Also,
if it is not clear if the finished warehouse or database as initially conceived will meet needs, the
"quick solution" functions as a test or prototype, that can be followed by a more systematic
solution. However, systematic data extraction and cleaning solution of some sort will be needed
for most problems (problems discussed in the *Data Quality Methods* and Procedures white
paper). The only real question is when, and how capable and complicated it must be. A cost
benefits analysis can help address how automated and capable the eventual solution should be.
The essential decision will address the
question: given a business need and set of
environmental conditions, what is the best
technical solution that will yield the best
costs to benefits ratio? The source of
information to that decision is the set of
answers to the following questions about
business needs and conditions:
Figures 1 and 2 model a decision flow that incorporates the above questions. The purpose of this model is to demonstrate the rational behind the selection of a set of tools with varying strengths and capabilities. Figure 1 highlights the differences in tool applicability among three different tasks: on-line data entry, on-line querying, and batch transport. On-line data entry and querying are not the focus of this paper. Batch transport is expanded upon in Figure 2, featuring decisions relevant to batch data extraction, cleaning, transformation and transport.
Each branch in the diagram represents a decision branch, and each box represents a tool category.
A specific recommended tool is listed in the diagram for each tool category. The sections below
list other tools that fit within that category. Figure 2 also features three tool "super categories":
Analysis, Data Quality and Scrubbing, and Extract Transport.
Taking Action. Only a set of tools will meet all requirements. Unfortunately, there is no one tool package that addresses that extremely large number of issues pertaining to data extraction, cleaning, and transport. As illustrated in Figure 2, different tools specialize in addressing different issues. The following tools each fit a specific need that must be addressed.
Glossary
Extraction, Transformation and Migration Tools, PART II
Return to KISMET Home Page