Data Extraction, Transformation, and Migration Tools


Key words: data extraction, data scrubbing, data migration, data warehousing, database modeling




Data Extraction, Cleansing and Migration Tools

to support Data Warehouse, Database Consolidation, and Systems Reengineering Projects

Part 2

© 1996 richard j. orli




Return to Part 1

Commercial Tools Review

Tool Categories

Tool Characteristics

Life Cycle niches

Brief Tool Descriptions




Commercial Tools Review

Most of the tools help to automate a process that can be done manually or with the use of other less specialized tools, such as COBOL. There are several possible make-or-buy decisions that have to be resolved through a costs-benefits assessment. However, the specialized tools on the market seem to meet most if not all of their marketing promises.

In the paragraphs below, we will review first the classes of tools available, and second the features of several representative tools.

The biggest challenge that faced in data migration is not technical but management. For one-time data extraction, such as needed to support a database redesign, the biggest challenge can be the analysis involved, or perhaps the technical aspects of the extract and transport. For routine extractions, such as monthly uploads to a warehouse, extraction process management is the biggest problem and expense. Figure 1 illustrates this point.

Several types of tool are specifically designed to aid in the management of routine extractions, filtering, and transport.

Other tools support these objectives by helping with the analysis process that necessarily precedes extraction and migration system design. Still other tools specialize in various aspects of the migration process life cycle.

Table 1 briefly describes tool categories and functions, and lists some representative tools within each category. In addition, the final column is a subjective assessment of how successful the current tool technology in each category is in achieving its promise. (For example, it can be argued that computer chess play is very good, but that the current technology for computer "artificial intelligence" poetry writing has a long way to go).



Table 1. Tool Categories
Tool Category Example Tools Tool Function Achieves Promise?
Data Reverse Engineering

- Metadata based

Bachman

LogicWorks

ERWIN/ERX

Embarcadero ER/1

Kismet KisMeta

Process metadata to document systems and abstract business rules and relationships. (High-end tools also derive some logical and conceptual schema information) Fair(1)
Data Reverse Engineering

- Data content based

Vality Integrity

QDB Analyze

Data Star

WizRule

Process data content in conjunction with metadata to abstract business rules and relationships, automatically. Fair(2)
Batch Export/Transport

- Parameter-based extraction code generators

Carleton Passport

ETI Extract

Prism Warehouse

Manager

Extraction is centrally controlled by parameters; extraction code programs are automatically generated. The tools accommodate data conversion and abstraction, as well as transport. Conversion and abstraction assumes a known and consistent format for source data. Tools offer sophisticated tracking and management of replication, a Prism strong point. Very Good
Batch Export/Transport

- You-write extraction code

3GL/4GL

(COBOL et al.)

Platinum InfoRefiner

Platinum InfoPump

Praxis OmniReplicator

Container for extraction/conversion code, interfaces with databases. Modest replication tracking capabilities. (Transformation and range of import/export targets more capable than Replication tools, below) Good to

Very Good

Replication IBM DataPropagator

Sybase Replication

Server

Designed specifically to manage replicated data (distributed database or warehouse). May include export/transport functions, or may use other tools. (Limited transformation and import/export capabilities) Very Good
Middleware -

On-line/interactive extract/transport

(Parameter vs. U-write?)

CA Enterprise Access

Platinum InfoHub

Praxis Omni

Replicator

Sybase Enterprise

Connect

IBM DataJoiner

Intersolv Sequelink

Similar to batch export transport tools in concept, but supports queries, including on-line queries, by creating a "virtual database" and automating the interface between several source system platforms and the user's query tools. Very Good
Data Content Quality - Filter based Apertus

Trillium

Positioned between export and import, these tools support parameter-based data filtering. These tools specialize in and are more capable at managing relationships and transformation than tools such as Prism. They can be used to keep data with different keys in alignment, to avoid misinterpretation during queries. Good
Data Content Quality - relationship based Vality

DB Star

WizRule

IDI

Data quality is evaluated based on data content. Data patterns, rules and relationships discovered assist analysts determine data quality problem areas. Fair(3)
Special Purpose Data Quality PostalSoft ACE

Group 1 Nadis

SSA

Data quality for special purposes, such as name/address correction, pharmaceutical naming conformity, etc. Very Good
Data Quality - Special purpose data entry support PostalSoft Library

Mailers +4

Automatic editing of address data as it is being entered on-line (including street name, address range, Zip, etc.). Tools are incorporated into data entry screens as class libraries, either client server or client solutions. Very Good
Data translation Data Junction

Cambio

Data format translation aids (for use in conjunction with other processes.) Very Good










Methods for Processing Data. Data can be processed for quality or distributed either in a batch mode, or for individual transactions. These are quite different in concept, and are handled separately in the balance of this discussion. Most tools do one or the other. Some tools, such as PostalSoft ACE, have separate components to handle each of these functions.

Batch movement should not be confused with what might be termed bulk movement. In the extraction and transport process, data is moved either in bulk, in which the entire database is refreshed periodically, or selectively refreshed with data changes. In general, the second option is preferred if the tool kit is sophisticated enough to handle the management and data integrity issues involved. Almost all extract and transport tools do bulk moves, and all of our example tools manage selective updates as well.

Table 2 compares selected characteristics of representative tools.

Another consideration in overall environment management is if the metadata catalog used is open or proprietary. None of the illustrated tools are entirely "closed", but some use proprietary structures that can share information only with some difficulty.

Table 2 Characteristics of Representative Tools (Draft! some data under review)
Batch-Oriented Data Extraction, Cleansing and Migration Tools
Tool Category


Tool Name
Subject-ive Rank(4)
Best for Exceptional or Repetitive
User Interface/Control Type
Main-frame/
Client Server
Migration Management Capability
Installation
Pain
Data Reverse Engineering

- Metadata based

Bachman 2 E P C n/a ok
ERWIN/ERX 1 E P C n/a ok
KisMeta (author's #1 baby) E P C n/a ok
PowerSoft 3 E P C n/a ok
Embarcadero ER/1 4 E P C n/a ok
Data Reverse Engineering - Data content based Vality Integrity 1 E P/M M n/a Pain
QDB Analyze 2 E M M n/a Pain
WizRule 3 E P C n/a ok
Batch Export/Transport

- Parameter-based extraction code generators

Carleton Passport 2 R P M Active so-so
ETI Extract 3 R P M Passive so-so
Prism 1 R P C Active so-so
Batch Export/Transport

- U-write extraction code

3GL/4GL

(COBOL et al.)

3 E/R M M/C None so-so
Platinum InfoRefiner 2 R P/M M Passive so-so
Platinum InfoPump 1 R P/M C Passive so-so
Replication Praxis OmniReplicator 1 R P C Active so-so
IBM DataPropagator 3 R P/M M Passive so-so
Oracle 7 Symmetric Replication 2 R P C Active so-so
Data Content Quality - Filter based Apertus 1 E/R P C Passive so-so
Trillium 2 E/R P M Passive so-so
Batch-Oriented Data Extraction, Cleansing and Migration Tools
Data Content Quality - relationship based Vality Integrity 1 E P/M M n/a Pain
WizRule 3 E P C n/a ok
IDI 2 E P C n/a ok
Special Purpose Data Quality PostalSoft ACE 1 E/R P M/C n/a so-so
SSA 1 E/R P/M M n/a so-so
Group 1 Nadis 2 E/R P M n/a so-so
On-Line/Real-time Data Extraction, Cleansing and Migration Tools
Tool Category Tool Name Subject-ive Rank Best for

Exceptional or Repetitive

User Interface/Control Type Mainframe/

Client Server

Distribution Management Capability
Middleware -

On-line/interactive extract/transport

(Parameter vs. U-write?)

Platinum InfoHub 1 R P C Active so-so
Praxis Omni Replicator 2 R P C Active so-so
Sybase Enterprise Connect 3 R P C Active so-so
IBM DataJoiner 4 R P C Active so-so
Intersolv Sequelink 3 R P C Active so-so
Data Quality - Data Entry support PostalSoft Library 2 R P C n/a so-so
Mailer +4 1 R P C n/a ok






LEGEND

Subjective Rank 1= best

Best for Exceptional or Repetitive (tasks) E= Exceptional, R = repetitive, E/R = both

User Interface/Control Type P= Parameter/Windows object driven, M=Manual, P/M= Combination

Mainframe/Client Server M=Mainframe, C= Client Server

Migration Management Capability Active= Active to most typical data, Passive=metadata is documentation

Installation Pain Ok=Good, so-so= Average, Pain=Major investment.











Data Migration life cycle niches illustrated in Table 3 include:

Table 3 Life Cycle Niche of Representative Tools
Tool Modeling/

Research

Extract Clean Abstract Transport/

Load

Carleton Passport !!! !! !! !!!
Prism !!! !! !! !!!
3GL/4GL (COBOL etc) ! ! ! !
Platinum InfoPump !! ! !!!
Platinum InfoRefiner !!! ! ! !!!
Platinum InfoHub !!! ! !! !!!
Vality

Integrity

!!! !
PostalSoft ACE !!! !
Apertus !! !!! !!! !
KisMeta !!! !
ERWin

ERX

!!!


Legend
No Features Facilitates Some Automation Automated

Assistance

! !! !!!




The highest level of perfection in a given extraction can be achieved only with custom programs. But custom programs are expensive, and dozens of custom programs in different environments are almost impossible to manage at the Enterprise level. It is fair to say that a collection of 50 technically excellent individual extraction programs can be collectively a technical disaster because of management complexities. Most commercial tools offer not only a technical assist, but provide a management structure critical for overall success.


Commercial data tools available include replicators and data pumps. Replicators are similar to data pumps, but differ primarily in their inability to deal with heterogeneous environments. Replicators are most useful when used to duplicate data in one location in another location's database, and usually with the same vendor's database product. For this reason, data transformation capabilities are also usually very limited. Data pumps are specifically designed to be platform independent and provide substantial transformation services. Data pumps all include some type of scripting or programming capability. (These terms are not used consistently - some tools marketed as data replicators have data pump capabilities)

Some tools, such as Prism, Apertus and Passport, allow the user to control most export and transformation activities through parameters and visual scripts - "advanced filtering". Other tools, such as InfoPump, act as shells into which custom extraction and filtering programs written in C or COBOL are inserted. Parameter-driven tools, while often requiring a larger up-front investment, offer a substantial management convenience in a large heterogeneous environment. This management advantage is of great potential value.

Most of these commercial tools offer some way in which data can be filtered for data quality during the extraction and transformation process. However, several specialty data cleaning or scrubbing tools (Apertus and Trillium) have more sophisticated means toward this end.

The best data quality tools are very targeted in scope. For example, SSA does names, Postal Soft and Group 1 do addresses very well, and very little else. Since names and addresses represent a significant data quality area of concern at many organizations, these are of wide value. Other specilized tools exist for specific applications, such as pharmaceuticals and civil engineering.

Other tools can do an analysis of a batch of data, but the information gleaned usually requires substantial manual analysis to interpret and custom programming to correct. Vality and QDB do data content analysis, using pattern matching, lexical analysis and user-specified rules.




1. "Logicals" are generated more as documentation of physical structures than as a conceptual model of the business in the top-down Information Engineering sense.

2. While the tools can do unique and valuable analysis, the product is not a finished model or even refined information, but usually just additional raw material that an analyst can use to help create a model.

3. The product is generally not clean data, but a list of possible problems that must be manually investigated and corrected.

4. Subjective Rank 1= best

Best for Exceptional or Repetitive (tasks) E= Exceptional, R = repetitive

User Interface/Control Type P= Parameter/Windows object driven, M=Manual, P/M= Combination

Mainframe/Client Server M=Mainframe, C= Client Server

Migration Management Capability Active= Active to most typical data, Passive=metadata is documentation

Installation Pain J=Good, K= Average, L=Major investment.



DATA MIGRATION TOOLS - Brief Descriptions of Representative Tools

Apertus Technologies, Enterprise Integrator Tool

Users write transformation rules. Data is filtered against domains and ranges of legal values and compared to other data structures.

Bachman

Modeling and reverse engineering tool. The reverse engineering tool derives a consistent set of metadata from several potential source system's metadata.

Carleton, Passport

Users enter extraction and transformation parameters, and Data is filtered against domains and ranges of legal values.

DBStar, Migration Architect

High-End rules and relationships discovery.

Embarcado, ER1

Departmental modeling and reverse engineering tool. The reverse engineering tool derives a consistent set of metadata from several potential source system's metadata.

Evolutionary Technology (ETI), EXTRACT

Users write transformation rules. Data is filtered against domains and ranges of legal values and compared to other data structures.

Information Discovery, IDI

Low-end rule discovery.

Kismet Analytic, KisMeta Workbench

Enterprise data modeling and reverse engineering tool. The tool derives a unified conceptual schema, and rationalizes and cross-indexes all of the enterprise's metadata. Evaluates metadata with lexical analysis, natural language interpertation, and statistical techniques.

LogicWorks, ERWIN ERX

Departmental modeling and reverse engineering tool. The reverse engineering tool derives a consistent set of metadata from several potential source system's metadata.

Oracle, Symmetric Replicator

A data replication product designed to extract data from several platforms, perform some filtering and transformation, and distribute and load to another database or databases.

Platinum, InfoRefiner

A data pump product designed to extract data from several mainframe platforms, perform some filtering and transformation, and distribute and load to another mainframe platform database. The extraction uses custom code modules written in COBOL. This is a Mainframe/MVS based tool.

Platinum, InfoPump

A data pump product designed to extract data from several mainframe and client server platforms, perform some filtering and transformation, and distribute and load to another mainframe platform database. Requires InfoHub for most services. The extraction uses custom code modules. This is a client/server based tool.

Platinum, InfoHub

A middleware product designed to establish a permanent relationship (including filtering and transformation) between source systems and a logical model. The logical model is then available as a virtual database to end-user query tools or a data migration product such as InfoPump.

Praxis, Omni Replicator

A data replication product designed to extract data from several platforms, perform some filtering and transformation, and distribute and load to another database or databases.

Prism Solutions, Warehouse Manager

Users enter extraction and transformation parameters, and Data is filtered against domains and ranges of legal values.

PostalSoft, ACE

Specialty Address correction tool. Uses a "brute force" strategy of comparing each address with a USPS database with over one hundred million valid US addresses.

PostalSoft, Library

A class library of callable routines that developers user to create "address-smart" data entry screens that automatically edit most address characteristics.

Group 1

Specialty Address correction tool. Uses a "brute force" strategy of comparing each address with a USPS database with over one hundred million valid US addresses.

Mailers +4

A class library of callable routines that developers user to create "address-smart" data entry screens that automatically edit most address characteristics.

QDB Solutions, QDB/Analyze

High-End rules and relationships discovery.

Search Software America (SSA)

Name and address data quality tool. Similar in technology to Vality Integrity, SSA evaluates data with lexical analysis tied to a database of known name frequency distribution in a given population.

Vality Technology, Integrity

Integrity evaluates data with lexical analysis rather than with rules or look-up tables. Best for parsing unstructured text fields.

WizSoft, WizRule

Low-end rule-discovery.



Glossary of Terms




Return to Table of Contents


Last Updated September 14, 1996 by info@kismeta.com
Click to request more information!