metadata, data element naming, object label naming, standards enforcement tool for support of warehousing, data mining, system engineering, data architecture and data analysis.
(Data Element Naming)
© 1997 Kismet Analytic Corp.
This paper provides a basic background in formal object label standards administration, and also conveys the philosophy behind Validator, the new metadata standards enforcement tool from Kismet Analytic Corporation. Validator is designed so that it may be used by all individuals responsible for system and database design and administration, as well as data management specialists. Typically an organization has a central group responsible for setting and disseminating standards, such as Data Administration (DA), and others responsible for using standards such as Database Administration (DBA), database designers, and programmers. Both groups need to understand standards, and must be able to test metadata labels against standards.
The value of information as business capital is now widely understood. The function of data administration - and object label standards - is to make information available, comprehensible and functional. We use the phrase "object label standards" rather than data standards because each object from element to entity to code module to system has a metadata label that benefits from naming standards.
Object/Data "administration" means classification and assembly into structures which can be
maintained and from which information can be derived. A clerk setting up a filing system who
clearly labels and organizes the folders according to a meaningful scheme, and then makes each
project team member aware of the existence and organization of this scheme, is a genuine data
administrator. In fact, every system development professional has some data administration
responsibilities.
If we carry the filing metaphor further, we can see that this system will be of the most use if the
pool of potential users share a common understanding of how files will be ordered and labeled.
Even a visitor from a sister office would know which drawer contained the desired document
and would also recognize the wording of the folder labels, since every office's files are
identically arranged. Also, if the files are arranged alphabetically by employee name, most users
would be surprised and probably disconcerted to find them arranged by the first name rather than
the last. This is an example of a "implicit" object label standard that need not be made explicit,
but these are in short supply in the real world. Most standards will have to be worked out as an
explicit standards "set", established by decree or by consensus agreement of the parties involved.
Unfortunately, naming conventions are typically inconsistent among
software applications. There have been several attempts to establish a
universal rules set for object label standards, most of which ride on the
same elementary structure, which we will elaborate shortly. Let us
simply preface our discussion by stating that by standardized object
label, we mean information that is described briefly, accurately, and uniquely. We strongly
believe that all object labels should be administered with both sharing and reuse in mind.
Few means exist which can have as powerful influence on information accessibility and system
user-friendliness as good object label standards. Standards reduce the effort required of each
project by providing a base on which sound structures can be built. In addition, standards are
central to the smooth accommodation of heterogeneous technologies and facilitate object
sharing that can dramatically reduce the cost of systems development and maintenance.
The focus of Validator-supported standardization is creating meaningful and consistent metadata
object labels. This helps with a much bigger and more important problem: knowing what
information is contained in the objects labeled. This bigger problem is beyond the scope of just
object labels, which cannot be both sufficiently expressive and concise and simple enough to be
practical. A name (object label) can only symbolically stand in for TRUE MEANING.
It is useful to consider a mechanism often used to
understand an object's TRUE MEANING: a semantic
network. The general idea is that a concept does not
stand alone, but is understood only in the context of
other concepts. Consider a person, what is she? The
answer is: Employee, Spouse, Attorney, Mother,
Georgian, Veteran, Class of '75... the list can be very
long. A complete understanding
is impossible without
understanding all of these other concepts, and the
concepts that they in turn relate to. Each piece of
information is attached to the real world by dozens of strands.

In light of this, we can see that a scheme whereby a
package of information is labeled with one "prime word"
such as "Employee" is rather lame if our expectation is
that one standard label will work for every function of a
small company, let alone for one that would work in a huge and diverse environment such as the
US Government. Yet, as a symbol it works.
Sometimes standardization efforts chase the dream of making a name somehow capture enough
meaning that anyone looking at it, in whatever context, would fully understand the meaning
intended. These efforts are likely to fail because of their impractical scope. The first job for a
label is to be a symbolic placeholder for an idea; a possible and proper additional ambition of a
standard object label is to modestly aid the beholder's process of perception - associating a thing
with a concept.
Full metadata requirements include at least a:
Other information may include stewardship, domain details, frequency, and access.
Context Independent Whole Labels - versus-
Context Dependent Component Assembly
In the context independent approach, organizations attempt to establish standards by publishing lists of valid labels that are to be used by all applicable systems. Good examples include:
SSN =Social Security Number
FIP =Financial Information Pointer (Account #)
EMP_ID =Employee Identifier
The intent is that the whole label is a context independent piece of information of real value to
the organization; the examples above are highly effective tools for achieving goals of object
label standards. This is also the goal the "Reference Data Sets" used by the US Department of
Defense SHADE program.
Unfortunately, the results are less effective when the concept is pushed further. For example,
the following is an actual example from a corporate object label standard:
"Payroll hrs + month chge amt"
This might be a useful fact for the payroll or personnel function, but it is difficult to conceive of
this information capsule being of widespread utility across multiple systems and departments. It
is precisely universal utility that is the test of the value and appropriateness of a standard. The
problem is that there are a great many useful idea combinations for which information can be
collected. If you identify thousands, that leaves millions unidentified.
The context dependent component assembly approach manages standards through four sets of
rules: lexical, semantic, syntactic, and procedure rules.
Given these rules, the user is able to construct the elements that meet his or her needs.
Context independence implies context inheritance. If a "class" is a more general or "higher
level" concept, it is the confluence of classes that uniquely define an object. In universal terms,
the set of data elements is the set of all class interaction combinations. A standard example of
context inheritance is as follows:
Business Area/Subject: Personnel
Entity: Employee
Element: Last Name
(Last name is understood to be of an Employee within the Personnel business area)
The final approach is one used heavily throughout the world for metadata labeling of books and
the like. The Dewey Decimal System and Library of Congress (LOC) system are arbitrary,
highly simplified, and flattened master conceptual schemas. Each piece of information is forced
into a slot in the scheme. If the fit is good in several slots or poor in any slot, a likely slot is
picked arbitrarily; one unique catalog number is always assigned.
The system partially accommodates other conceptual schemas through a virtual mechanism:
cross indexing - having multiple "cards" for every book, one in each major subject area. (e.g.
books on historic shoes in 'fashion' with a chapter on shoe repair may get a secondary reference
under 'crafts'.)
This is more-or-less the intent behind the use of prime terms. (e.g. just as the "library science"
slot might not be perfect for data element naming guides, "Employee" might not be the perfect
entity for "Employee_Home_Phone_Code" but in each case it may be the best fit available.)
Unfortunately, the required supporting conceptual schema, if it exists at all, is often feeble and
poorly understood compared to the well defined structure of the Dewey Decimal System.
| A Note on Terminology |
| Several different words are sometimes used to
describe each basic data object, a fact which may
cause great and justifiable confusion.
The distinctions made are often useful to distinguish between different phases of the design process. For example, though the terms entity and record type are nearly synonymous, we will use the term entity when speaking in more abstract, "logical" frameworks, and the term "record type" when referring to a physical database table. In general, entities represent business concepts with specific semantic meanings and domains. Similarly, we have tried to be consistent in our use of the word attribute to refer to one of the descriptive characteristics of a record (i.e., a label for a column), and the word element to mean a component of an entity; we have avoided using the term field because of its frequent application in describing a single cell in a table. |
The three broad semantic classifications of data are:
Prime/Mod/Class ISO11197
1. (Data Type) Class Representation term
2. Subject/Prime/Topic Object Class Term
3. Modifier Qualifier Term
The Prime/ Modifier/Datatype Class scheme, popularized by W.R. Durrell, has been adopted in
many data standards, including DOD8320. A competing but similar standard devised recently is
ISO11197.
Datatype Class words are used to identify and describe the general purpose (or use) of a data
element. Examples include code, amount, age, ID, and name. They are roughly related to the
physical data type of the database element, e.g. alphanumeric, real, memo, etc. ISO11197 uses
the expression "representation term" to stress the form of the data being described.
Prime words represent a topic or subject area. They are often the names of entities on a logical
model. Examples include "customer," "invoice," and "employee." A prime word is the most
important identifier of the element; it anchors the semantic and conceptual meaning of the
object being defined. ISO11179 uses the expression "object class term."
Modifying words, such as ordinals and other adjectives, are used to round out an element name,
providing any further detail. ISO11179 uses the expression "qualifier term."
By this model, then, every element should consist of one class word, one prime word, and one or
two modifying words. Though hardly universal, the standard syntax for an element label is:
modifier(s) + datatype class word
or
prime + modifier(s) + datatype class word
or
prime + prime-modifier + class modifier + datatype class word
A convenient consequence of placing the prime word first is that related elements are grouped
together when in alphabetical order.
When class modifier terms are used, they are usually:
And the datatype class is tightly defined as a
Data designers distinguish between so-called "flat-file" databases, in which most or all relevant
data for a particular purpose is kept in a single table, and "relational" databases, in which the
data is kept in multiple tables which are connectable by shared elements. The flat file designer
seeks consolidation; the relational designer seeks to avoid redundancy and inconsistency.
A "relationship" is evidence of a meaningful association between two entities. Relationships are
described by cardinality (e.g. one-to-many) and other aspects of the business rule they represent.
A relational data structure identifies a set of architectural components that reflect the
prime/modifier/datatype class scheme.
First, a hierarchy of subjects (or topics) is specified:
Function/domain/database: = High level subject
Record type/file/entity: = More detailed subject/topic
Primary key: = Specific topic
Foreign key: = Related or subordinate topic
Main term of element: = Minor subordinate topic
Secondary term of element: = Minor subordinate topic
Second, data types are classified:
Class words within elements
Data type field content
Third, elements are detailed using ordinals and other adjectives as modifiers:
Personnel - Employee - (last)name
Personnel - Employee - spouse(last)name
[All of these structures can be described as relationships within an ontology, or semantic
network, the proper construction and maintenance of which is the role of kisMeta Analyst and
Schemer.]
Syntactic Classification
Components of the Relational Data Structures
The Element Specification
The aspects of the element may be taken collectively as the "specification." Each specification
contains three types of information: conceptual, internal (physical) and external (logical).
1. Conceptual: The fundamental meaning of the element, including the element name and
the description, map to conceptual schema, business rules, and domain.
External: How the information is represented to the user, primarily the data type and
display format, and name on form.
Two types of names are typically assigned to elements: the business (long) name and the
physical (short) name. The business name is the foundation for an entity, data element and
attribute. The physical name is the abbreviated form of the business name, and is what
typically appears in a physical database table. In each case, they should consist of the minimum
number of words that adequately identify the data element.
A lexicon of standard terms is essential to any naming standard. The terms should have full
names for identification purposes, as well as standard abbreviations. Term names will be used
in the business name of an element, while abbreviations will be used in the physical
representations of that element, which must be shorter. In addition, each term should be given a
description, to avoid confusion where a single term is common to two or more business
functions but has different meanings for each. Minus the definitions, your lexicon becomes the
Valid Terms List as used by Validator.
Model Business Naming Conventions
The following rules for devising business names are common to a number of standard sets. They
are offered here as a collection of "model" data element naming standards that may provide
inspiration. Where Validator specifically assists with the enforcement of a standard, the rule is
followed by "SPEC" plus the number of the specification screen page on which the option is
found. Where Validator does the process automatically, or optionally generates an issue
message, the rule is followed by "VAL."
Semantic Rules
Syntax Rules
Structuring Conventions
Physical Naming Conventions
Physical attribute names are constructed by abbreviating the business name, using standard
abbreviations. They should conform to the restrictions of the programming language and the
DBMS being used (SPEC - 1). (For example, SQL Server imposes an 18 character limit.)
Guideline for Abbreviating Names
Disseminating and Enforcing Standards
Issues include:
Proper Scope. Standards must be set for an entire organization or enterprise; there is no room
for separate standards in multiple divisions of a company.
Acceptance and support. Effective standards require "buy-in" from both management and
technical development staff. Standards must be mandated by management, and supported by
policies and procedures.
Availability. Standards must be published and disseminated through paper documentation and
training, and ideally through an automated validation tool such as Validator.
Enforceability. Relevant standards, kept simple and understandable. An easy-to-use testing
method must be available which can be used by any local developer who is designing a data
structure. Beyond a few generalities, all standards must be expressed as specific criteria which
can be tested: e.g., "not more than 50 characters" in contrast to "not too long".
Responsibilities for Maintenance. Standards must be monitored and maintained by a central
function such as DA, DBA, or system architecture. Individuals responsible for using standards -
such as database developers, must provide information feedback that will be used to ensure that
the standards remain practical and relevant in a changing technological and business
environment.
Responsibilities for Enforcement. Standards should also be enforced centrally, in the course of
system design reviews; however, standards must be primarily enforced as part of normal
business practice by development and design staff. Roles include:
Timeliness. Standards must be applied when systems are designed for the first time, not as part
of some formal review after-the-fact when coding is half-done and databases are populated with
test data. That is too late, and is very costly in resource use and in delayed schedules, and may
even result in standards which are ignored for expediency. For this reason, standards
enforcement cannot be left solely in the hands of a central group such as DA, but must be part of
every developer's responsibility.
Conclusion
Object label design and naming is a big job with invaluable consequences. Adhering to standards
is a necessary part of the job if information is to be easily accessed and shared. We have offered
some opinions about what makes some standards better or easier to use than others, but
ultimately the best standard is one that is actually used. Validator can alleviate the burden by
testing labels and offering suggestions for modifications based on standards specified by the
user, and can enforce consistency.
Best of all, Validator can be placed in the hands of the people who need it: system designers and
database architects doing new systems or system reengineering work. Validator's Enterprise
version is designed to be centrally maintained, but to be used by every project team in an
organization.
Kismet Analytic Corp. PoBox 3218 Arlington VA 22203 Phone/fax 703-531-3845
Naming Conventions
Business names should be clear,
accurate and self-explanatory.
Last Updated on November 30, 1997. By: info@kismeta.com
