Key words: data extraction, data scrubbing, data migration, data warehousing, database modeling
Data Extraction, Cleansing and Migration Tools
to support Data Warehouse, Database Consolidation, and Systems Reengineering Projects
© 1996 richard j. orli
Return to Part 1
Commercial Tools Review
Tool Categories
Tool Characteristics
Life Cycle niches
Brief Tool Descriptions
Commercial Tools Review
Most of the tools help to automate a process that can be done manually or with the use of other less specialized tools, such as COBOL. There are several possible make-or-buy decisions that have to be resolved through a costs-benefits assessment. However, the specialized tools on the market seem to meet most if not all of their marketing promises.
In the paragraphs below, we will review first the classes of tools available, and second the features of several representative tools.
The biggest challenge that faced in data migration is not technical but management. For one-time
data extraction, such as needed to support a database redesign, the biggest challenge can be the
analysis involved, or perhaps the technical aspects of the extract and transport. For routine
extractions, such as monthly uploads to a warehouse, extraction process management is the
biggest problem and expense. Figure 1 illustrates this point.
Several types of tool are specifically designed to aid in the management of routine extractions, filtering, and transport.
Other tools support these objectives by helping with the analysis process that necessarily precedes extraction and migration system design. Still other tools specialize in various aspects of the migration process life cycle.
Table 1 briefly describes tool categories and
functions, and lists some representative tools
within each category. In addition, the final
column is a subjective assessment of how
successful the current tool technology in each category is in achieving its promise. (For example,
it can be argued that computer chess play is very good, but that the current technology for
computer "artificial intelligence" poetry writing has a long way to go).
Table 1. Tool Categories
Tool Category | Example Tools | Tool Function | Achieves Promise? |
Data Reverse
Engineering
- Metadata based |
Bachman
LogicWorks ERWIN/ERX Embarcadero ER/1 Kismet KisMeta |
Process metadata to document systems and abstract business rules and relationships. (High-end tools also derive some logical and conceptual schema information) | Fair(1) |
Data Reverse
Engineering
- Data content based |
Vality Integrity
QDB Analyze Data Star WizRule |
Process data content in conjunction with metadata to abstract business rules and relationships, automatically. | Fair(2) |
Batch
Export/Transport
- Parameter-based extraction code generators |
Carleton Passport
ETI Extract Prism Warehouse Manager |
Extraction is centrally controlled by parameters; extraction code programs are automatically generated. The tools accommodate data conversion and abstraction, as well as transport. Conversion and abstraction assumes a known and consistent format for source data. Tools offer sophisticated tracking and management of replication, a Prism strong point. | Very Good |
Batch
Export/Transport
- You-write extraction code |
3GL/4GL
(COBOL et al.) Platinum InfoRefiner Platinum InfoPump Praxis OmniReplicator |
Container for extraction/conversion code, interfaces with databases. Modest replication tracking capabilities. (Transformation and range of import/export targets more capable than Replication tools, below) | Good to
Very Good |
Replication | IBM
DataPropagator
Sybase Replication Server |
Designed specifically to manage replicated data (distributed database or warehouse). May include export/transport functions, or may use other tools. (Limited transformation and import/export capabilities) | Very Good |
Middleware -
On-line/interactive extract/transport (Parameter vs. U-write?) |
CA Enterprise
Access
Platinum InfoHub Praxis Omni Replicator Sybase Enterprise Connect IBM DataJoiner Intersolv Sequelink |
Similar to batch export transport tools in concept, but supports queries, including on-line queries, by creating a "virtual database" and automating the interface between several source system platforms and the user's query tools. | Very Good |
Data Content Quality - Filter based | Apertus
Trillium |
Positioned between export and import, these tools support parameter-based data filtering. These tools specialize in and are more capable at managing relationships and transformation than tools such as Prism. They can be used to keep data with different keys in alignment, to avoid misinterpretation during queries. | Good |
Data Content Quality - relationship based | Vality
DB Star WizRule IDI |
Data quality is evaluated based on data content. Data patterns, rules and relationships discovered assist analysts determine data quality problem areas. | Fair(3) |
Special Purpose Data Quality | PostalSoft ACE
Group 1 Nadis SSA |
Data quality for special purposes, such as name/address correction, pharmaceutical naming conformity, etc. | Very Good |
Data Quality - Special purpose data entry support | PostalSoft Library
Mailers +4 |
Automatic editing of address data as it is being entered on-line (including street name, address range, Zip, etc.). Tools are incorporated into data entry screens as class libraries, either client server or client solutions. | Very Good |
Data translation | Data Junction
Cambio |
Data format translation aids (for use in conjunction with other processes.) | Very Good |
Methods for Processing Data. Data can be processed for quality or distributed either in a batch
mode, or for individual transactions. These are quite different in concept, and are handled
separately in the balance of this discussion. Most tools do one or the other. Some tools, such as
PostalSoft ACE, have separate components to handle each of these functions.
Batch movement should not be confused with what might be termed bulk movement. In the
extraction and transport process, data is moved either in bulk, in which the entire database is
refreshed periodically, or selectively refreshed with data changes. In general, the second option is
preferred if the tool kit is sophisticated enough to handle the management and data integrity issues
involved. Almost all extract and transport tools do bulk moves, and all of our example tools
manage selective updates as well.
Table 2 compares selected characteristics of representative tools.
Another consideration in overall environment management is if the metadata catalog used is open
or proprietary. None of the illustrated tools are entirely "closed", but some use proprietary
structures that can share information only with some difficulty.
Table 2 Characteristics of Representative Tools (Draft! some data under review)
Batch-Oriented Data Extraction, Cleansing and Migration Tools | |||||||
Tool Category |
Tool Name |
Subject-ive Rank(4) |
Best for Exceptional or Repetitive |
User Interface/Control Type |
Main-frame/Client Server |
Migration Management Capability |
InstallationPain |
Data Reverse
Engineering
- Metadata based |
Bachman | 2 | E | P | C | n/a | ok |
ERWIN/ERX | 1 | E | P | C | n/a | ok | |
KisMeta | (author's #1 baby) | E | P | C | n/a | ok | |
PowerSoft | 3 | E | P | C | n/a | ok | |
Embarcadero ER/1 | 4 | E | P | C | n/a | ok | |
Data Reverse Engineering - Data content based | Vality Integrity | 1 | E | P/M | M | n/a | Pain |
QDB Analyze | 2 | E | M | M | n/a | Pain | |
WizRule | 3 | E | P | C | n/a | ok |
Batch
Export/Transport
- Parameter-based extraction code generators |
Carleton Passport | 2 | R | P | M | Active | so-so |
ETI Extract | 3 | R | P | M | Passive | so-so | |
Prism | 1 | R | P | C | Active | so-so | |
Batch
Export/Transport
- U-write extraction code |
3GL/4GL
(COBOL et al.) |
3 | E/R | M | M/C | None | so-so |
Platinum InfoRefiner | 2 | R | P/M | M | Passive | so-so | |
Platinum InfoPump | 1 | R | P/M | C | Passive | so-so | |
Replication | Praxis OmniReplicator | 1 | R | P | C | Active | so-so |
IBM DataPropagator | 3 | R | P/M | M | Passive | so-so | |
Oracle 7 Symmetric Replication | 2 | R | P | C | Active | so-so | |
Data Content Quality - Filter based | Apertus | 1 | E/R | P | C | Passive | so-so |
Trillium | 2 | E/R | P | M | Passive | so-so | |
Batch-Oriented Data Extraction, Cleansing and Migration Tools | |||||||
Data Content Quality - relationship based | Vality Integrity | 1 | E | P/M | M | n/a | Pain |
WizRule | 3 | E | P | C | n/a | ok | |
IDI | 2 | E | P | C | n/a | ok | |
Special Purpose Data Quality | PostalSoft ACE | 1 | E/R | P | M/C | n/a | so-so |
SSA | 1 | E/R | P/M | M | n/a | so-so | |
Group 1 Nadis | 2 | E/R | P | M | n/a | so-so | |
On-Line/Real-time Data Extraction, Cleansing and Migration Tools | |||||||
Tool Category | Tool Name | Subject-ive Rank | Best
for
Exceptional or Repetitive |
User Interface/Control Type | Mainframe/
Client Server |
Distribution Management Capability | |
Middleware -
On-line/interactive extract/transport (Parameter vs. U-write?) |
Platinum InfoHub | 1 | R | P | C | Active | so-so |
Praxis Omni Replicator | 2 | R | P | C | Active | so-so | |
Sybase Enterprise Connect | 3 | R | P | C | Active | so-so | |
IBM DataJoiner | 4 | R | P | C | Active | so-so | |
Intersolv Sequelink | 3 | R | P | C | Active | so-so | |
Data Quality - Data Entry support | PostalSoft Library | 2 | R | P | C | n/a | so-so |
Mailer +4 | 1 | R | P | C | n/a | ok |
LEGEND
Subjective Rank 1= best Best for Exceptional or Repetitive (tasks) E= Exceptional, R = repetitive, E/R = both User Interface/Control Type P= Parameter/Windows object driven, M=Manual, P/M= Combination Mainframe/Client Server M=Mainframe, C= Client Server Migration Management Capability Active= Active to most typical data, Passive=metadata is documentation Installation Pain Ok=Good, so-so= Average, Pain=Major investment. |
Data Migration life cycle niches illustrated in Table 3 include:
Table 3 Life Cycle Niche of Representative Tools
Tool | Modeling/
Research |
Extract | Clean | Abstract | Transport/
Load |
Carleton Passport | !!! | !! | !! | !!! | |
Prism | !!! | !! | !! | !!! | |
3GL/4GL (COBOL etc) | ! | ! | ! | ! | |
Platinum InfoPump | !! | ! | !!! | ||
Platinum InfoRefiner | !!! | ! | ! | !!! | |
Platinum InfoHub | !!! | ! | !! | !!! | |
Vality
Integrity |
!!! | ! | |||
PostalSoft ACE | !!! | ! | |||
Apertus | !! | !!! | !!! | ! | |
KisMeta | !!! | ! | |||
ERWin
ERX |
!!! |
Legend
Assistance Commercial data tools available include replicators and data pumps. Replicators are
similar to data pumps, but differ primarily in their inability to deal with heterogeneous
environments. Replicators are most useful when used to duplicate data in one location in
another location's database, and usually with the same vendor's database product. For this
reason, data transformation capabilities are also usually very limited. Data pumps are
specifically designed to be platform independent and provide substantial transformation
services. Data pumps all include some type of scripting or programming capability. (These
terms are not used consistently - some tools marketed as data replicators have data pump
capabilities)
Some tools, such as Prism, Apertus and Passport, allow the user to control most export and
transformation activities through parameters and visual scripts - "advanced filtering".
Other tools, such as InfoPump, act as shells into which custom extraction and filtering
programs written in C or COBOL are inserted. Parameter-driven tools, while often
requiring a larger up-front investment, offer a substantial management convenience in a
large heterogeneous environment. This management advantage is of great potential value.
Most of these commercial tools offer some way in which data can be filtered for data
quality during the extraction and transformation process. However, several specialty data
cleaning or scrubbing tools (Apertus and Trillium) have more sophisticated means toward
this end.
The best data quality tools are very targeted in scope. For example, SSA does names,
Postal Soft and Group 1 do addresses very well, and very little else. Since names and
addresses represent a significant data quality area of concern at many organizations, these
are of wide value. Other specilized tools exist for specific applications, such as
pharmaceuticals and civil engineering.
Other tools can do an analysis of a batch of data, but the information gleaned usually
requires substantial manual analysis to interpret and custom programming to correct.
Vality and QDB do data content analysis, using pattern matching, lexical analysis and
user-specified rules.
2. While the tools can do unique and valuable analysis, the product is not a finished model
or even refined information, but usually just additional raw material that an analyst can
use to help create a model.
3. The product is generally not clean data, but a list of possible problems that must be
manually investigated and corrected.
4. Subjective Rank 1= best
Best for Exceptional or Repetitive (tasks) E= Exceptional, R = repetitive
User Interface/Control Type P= Parameter/Windows object driven, M=Manual, P/M=
Combination
Mainframe/Client Server M=Mainframe, C= Client Server
Migration Management Capability Active= Active to most typical data, Passive=metadata
is documentation
Installation Pain J=Good, K= Average, L=Major investment.
DATA MIGRATION TOOLS - Brief Descriptions of Representative Tools
Apertus Technologies, Enterprise Integrator Tool
Users write transformation rules. Data is filtered against domains and ranges of legal
values and compared to other data structures.
Bachman
Modeling and reverse engineering tool. The reverse engineering tool derives a consistent
set of metadata from several potential source system's metadata.
Carleton, Passport
Users enter extraction and transformation parameters, and Data is filtered against domains
and ranges of legal values.
DBStar, Migration Architect
High-End rules and relationships discovery.
Embarcado, ER1
Departmental modeling and reverse engineering tool. The reverse engineering tool derives
a consistent set of metadata from several potential source system's metadata.
Evolutionary Technology (ETI), EXTRACT
Users write transformation rules. Data is filtered against domains and ranges of legal
values and compared to other data structures.
Information Discovery, IDI
Low-end rule discovery.
Kismet Analytic, KisMeta Workbench
Enterprise data modeling and reverse engineering tool. The tool derives a unified
conceptual schema, and rationalizes and cross-indexes all of the enterprise's metadata.
Evaluates metadata with lexical analysis, natural language interpertation, and statistical
techniques.
LogicWorks, ERWIN ERX
Departmental modeling and reverse engineering tool. The reverse engineering tool derives
a consistent set of metadata from several potential source system's metadata.
Oracle, Symmetric Replicator
A data replication product designed to extract data from several platforms, perform some
filtering and transformation, and distribute and load to another database or databases.
Platinum, InfoRefiner
A data pump product designed to extract data from several mainframe platforms, perform
some filtering and transformation, and distribute and load to another mainframe platform
database. The extraction uses custom code modules written in COBOL. This is a
Mainframe/MVS based tool.
Platinum, InfoPump
A data pump product designed to extract data from several mainframe and client server
platforms, perform some filtering and transformation, and distribute and load to another
mainframe platform database. Requires InfoHub for most services. The extraction uses
custom code modules. This is a client/server based tool.
Platinum, InfoHub
A middleware product designed to establish a permanent relationship (including filtering
and transformation) between source systems and a logical model. The logical model is then
available as a virtual database to end-user query tools or a data migration product such as
InfoPump.
Praxis, Omni Replicator
A data replication product designed to extract data from several platforms, perform some
filtering and transformation, and distribute and load to another database or databases.
Prism Solutions, Warehouse Manager
Users enter extraction and transformation parameters, and Data is filtered against domains
and ranges of legal values.
PostalSoft, ACE
Specialty Address correction tool. Uses a "brute force" strategy of comparing each address
with a USPS database with over one hundred million valid US addresses.
PostalSoft, Library
A class library of callable routines that developers user to create "address-smart" data
entry screens that automatically edit most address characteristics.
Group 1
Specialty Address correction tool. Uses a "brute force" strategy of comparing each address
with a USPS database with over one hundred million valid US addresses.
Mailers +4
A class library of callable routines that developers user to create "address-smart" data
entry screens that automatically edit most address characteristics.
QDB Solutions, QDB/Analyze
High-End rules and relationships discovery.
Search Software America (SSA)
Name and address data quality tool. Similar in technology to Vality Integrity, SSA
evaluates data with lexical analysis tied to a database of known name frequency
distribution in a given population.
Vality Technology, Integrity
Integrity evaluates data with lexical analysis rather than with rules or look-up tables. Best
for parsing unstructured text fields.
WizSoft, WizRule
Low-end rule-discovery.
Glossary of Terms
No
Features
Facilitates
Some
Automation
Automated
!
!!
!!!
1. "Logicals" are generated more as documentation of physical structures than as a
conceptual model of the business in the top-down Information Engineering sense.
Return to Table of Contents