metadata, data element naming, object label naming, standards enforcement tool for support of warehousing, data mining, system engineering, data architecture and data analysis.

Object Label Standards: Metadata Standards Enforcement

(Data Element Naming)

© 1997 Kismet Analytic Corp.

This paper provides a basic background in formal object label standards administration, and also conveys the philosophy behind Validator, the new metadata standards enforcement tool from Kismet Analytic Corporation. Validator is designed so that it may be used by all individuals responsible for system and database design and administration, as well as data management specialists. Typically an organization has a central group responsible for setting and disseminating standards, such as Data Administration (DA), and others responsible for using standards such as Database Administration (DBA), database designers, and programmers. Both groups need to understand standards, and must be able to test metadata labels against standards.





Why Object Label Standards?

The value of information as business capital is now widely understood. The function of data administration - and object label standards - is to make information available, comprehensible and functional. We use the phrase "object label standards" rather than data standards because each object from element to entity to code module to system has a metadata label that benefits from naming standards.

Object/Data "administration" means classification and assembly into structures which can be maintained and from which information can be derived. A clerk setting up a filing system who clearly labels and organizes the folders according to a meaningful scheme, and then makes each project team member aware of the existence and organization of this scheme, is a genuine data administrator. In fact, every system development professional has some data administration responsibilities.

If we carry the filing metaphor further, we can see that this system will be of the most use if the pool of potential users share a common understanding of how files will be ordered and labeled. Even a visitor from a sister office would know which drawer contained the desired document and would also recognize the wording of the folder labels, since every office's files are identically arranged. Also, if the files are arranged alphabetically by employee name, most users would be surprised and probably disconcerted to find them arranged by the first name rather than the last. This is an example of a "implicit" object label standard that need not be made explicit, but these are in short supply in the real world. Most standards will have to be worked out as an explicit standards "set", established by decree or by consensus agreement of the parties involved.

Unfortunately, naming conventions are typically inconsistent among software applications. There have been several attempts to establish a universal rules set for object label standards, most of which ride on the same elementary structure, which we will elaborate shortly. Let us simply preface our discussion by stating that by standardized object label, we mean information that is described briefly, accurately, and uniquely. We strongly believe that all object labels should be administered with both sharing and reuse in mind.

Few means exist which can have as powerful influence on information accessibility and system user-friendliness as good object label standards. Standards reduce the effort required of each project by providing a base on which sound structures can be built. In addition, standards are central to the smooth accommodation of heterogeneous technologies and facilitate object sharing that can dramatically reduce the cost of systems development and maintenance.

What do Standards Standardize?

The focus of Validator-supported standardization is creating meaningful and consistent metadata object labels. This helps with a much bigger and more important problem: knowing what information is contained in the objects labeled. This bigger problem is beyond the scope of just object labels, which cannot be both sufficiently expressive and concise and simple enough to be practical. A name (object label) can only symbolically stand in for TRUE MEANING.

It is useful to consider a mechanism often used to understand an object's TRUE MEANING: a semantic network. The general idea is that a concept does not stand alone, but is understood only in the context of other concepts. Consider a person, what is she? The answer is: Employee, Spouse, Attorney, Mother, Georgian, Veteran, Class of '75... the list can be very long. A complete understanding is impossible without understanding all of these other concepts, and the concepts that they in turn relate to. Each piece of information is attached to the real world by dozens of strands.

In light of this, we can see that a scheme whereby a package of information is labeled with one "prime word" such as "Employee" is rather lame if our expectation is that one standard label will work for every function of a small company, let alone for one that would work in a huge and diverse environment such as the US Government. Yet, as a symbol it works.











Sometimes standardization efforts chase the dream of making a name somehow capture enough meaning that anyone looking at it, in whatever context, would fully understand the meaning intended. These efforts are likely to fail because of their impractical scope. The first job for a label is to be a symbolic placeholder for an idea; a possible and proper additional ambition of a standard object label is to modestly aid the beholder's process of perception - associating a thing with a concept.

Full metadata requirements include at least a:

Other information may include stewardship, domain details, frequency, and access.



Two Philosophies of Naming Standards

Context Independent Whole Labels - versus-

Context Dependent Component Assembly

In the context independent approach, organizations attempt to establish standards by publishing lists of valid labels that are to be used by all applicable systems. Good examples include:

SSN =Social Security Number

FIP =Financial Information Pointer (Account #)

EMP_ID =Employee Identifier

The intent is that the whole label is a context independent piece of information of real value to the organization; the examples above are highly effective tools for achieving goals of object label standards. This is also the goal the "Reference Data Sets" used by the US Department of Defense SHADE program.

Unfortunately, the results are less effective when the concept is pushed further. For example, the following is an actual example from a corporate object label standard:

"Payroll hrs + month chge amt"

This might be a useful fact for the payroll or personnel function, but it is difficult to conceive of this information capsule being of widespread utility across multiple systems and departments. It is precisely universal utility that is the test of the value and appropriateness of a standard. The problem is that there are a great many useful idea combinations for which information can be collected. If you identify thousands, that leaves millions unidentified.

The context dependent component assembly approach manages standards through four sets of rules: lexical, semantic, syntactic, and procedure rules.

Given these rules, the user is able to construct the elements that meet his or her needs.

Context independence implies context inheritance. If a "class" is a more general or "higher level" concept, it is the confluence of classes that uniquely define an object. In universal terms, the set of data elements is the set of all class interaction combinations. A standard example of context inheritance is as follows:

Business Area/Subject: Personnel

Entity: Employee

Element: Last Name

(Last name is understood to be of an Employee within the Personnel business area)

A Third Approach

The final approach is one used heavily throughout the world for metadata labeling of books and the like. The Dewey Decimal System and Library of Congress (LOC) system are arbitrary, highly simplified, and flattened master conceptual schemas. Each piece of information is forced into a slot in the scheme. If the fit is good in several slots or poor in any slot, a likely slot is picked arbitrarily; one unique catalog number is always assigned.

The system partially accommodates other conceptual schemas through a virtual mechanism: cross indexing - having multiple "cards" for every book, one in each major subject area. (e.g. books on historic shoes in 'fashion' with a chapter on shoe repair may get a secondary reference under 'crafts'.)

This is more-or-less the intent behind the use of prime terms. (e.g. just as the "library science" slot might not be perfect for data element naming guides, "Employee" might not be the perfect entity for "Employee_Home_Phone_Code" but in each case it may be the best fit available.) Unfortunately, the required supporting conceptual schema, if it exists at all, is often feeble and poorly understood compared to the well defined structure of the Dewey Decimal System.































A Note on Terminology
Several different words are sometimes used to describe each basic data object, a fact which may cause great and justifiable confusion.

The distinctions made are often useful to distinguish between different phases of the design process. For example, though the terms entity and record type are nearly synonymous, we will use the term entity when speaking in more abstract, "logical" frameworks, and the term "record type" when referring to a physical database table. In general, entities represent business concepts with specific semantic meanings and domains. Similarly, we have tried to be consistent in our use of the word attribute to refer to one of the descriptive characteristics of a record (i.e., a label for a column), and the word element to mean a component of an entity; we have avoided using the term field because of its frequent application in describing a single cell in a table.



Semantic Classification

The three broad semantic classifications of data are:

Prime/Mod/Class ISO11197

1. (Data Type) Class Representation term

2. Subject/Prime/Topic Object Class Term

3. Modifier Qualifier Term

The Prime/ Modifier/Datatype Class scheme, popularized by W.R. Durrell, has been adopted in many data standards, including DOD8320. A competing but similar standard devised recently is ISO11197.

Datatype Class words are used to identify and describe the general purpose (or use) of a data element. Examples include code, amount, age, ID, and name. They are roughly related to the physical data type of the database element, e.g. alphanumeric, real, memo, etc. ISO11197 uses the expression "representation term" to stress the form of the data being described.

Prime words represent a topic or subject area. They are often the names of entities on a logical model. Examples include "customer," "invoice," and "employee." A prime word is the most important identifier of the element; it anchors the semantic and conceptual meaning of the object being defined. ISO11179 uses the expression "object class term."

Modifying words, such as ordinals and other adjectives, are used to round out an element name, providing any further detail. ISO11179 uses the expression "qualifier term."

Syntactic Classification

By this model, then, every element should consist of one class word, one prime word, and one or two modifying words. Though hardly universal, the standard syntax for an element label is:

modifier(s) + datatype class word

or

prime + modifier(s) + datatype class word

or

prime + prime-modifier + class modifier + datatype class word

A convenient consequence of placing the prime word first is that related elements are grouped together when in alphabetical order.

When class modifier terms are used, they are usually:

And the datatype class is tightly defined as a

Components of the Relational Data Structures

Data designers distinguish between so-called "flat-file" databases, in which most or all relevant data for a particular purpose is kept in a single table, and "relational" databases, in which the data is kept in multiple tables which are connectable by shared elements. The flat file designer seeks consolidation; the relational designer seeks to avoid redundancy and inconsistency.

A "relationship" is evidence of a meaningful association between two entities. Relationships are described by cardinality (e.g. one-to-many) and other aspects of the business rule they represent.

A relational data structure identifies a set of architectural components that reflect the prime/modifier/datatype class scheme.

First, a hierarchy of subjects (or topics) is specified:

Function/domain/database: = High level subject

Record type/file/entity: = More detailed subject/topic

Primary key: = Specific topic

Foreign key: = Related or subordinate topic

Main term of element: = Minor subordinate topic

Secondary term of element: = Minor subordinate topic

Second, data types are classified:

Class words within elements

Data type field content

Third, elements are detailed using ordinals and other adjectives as modifiers:

Personnel - Employee - (last)name

Personnel - Employee - spouse(last)name

[All of these structures can be described as relationships within an ontology, or semantic network, the proper construction and maintenance of which is the role of kisMeta Analyst and Schemer.]

The Element Specification

The aspects of the element may be taken collectively as the "specification." Each specification contains three types of information: conceptual, internal (physical) and external (logical).

1. Conceptual: The fundamental meaning of the element, including the element name and the description, map to conceptual schema, business rules, and domain.

External: How the information is represented to the user, primarily the data type and display format, and name on form.

  1. Internal (Physical): How the values are stored in a field, usually consisting of the datatype and length, the required value, the range of values, and the default value. This may be extended to include data systems characteristics.


Naming Conventions

Two types of names are typically assigned to elements: the business (long) name and the physical (short) name. The business name is the foundation for an entity, data element and attribute. The physical name is the abbreviated form of the business name, and is what typically appears in a physical database table. In each case, they should consist of the minimum number of words that adequately identify the data element.

A lexicon of standard terms is essential to any naming standard. The terms should have full names for identification purposes, as well as standard abbreviations. Term names will be used in the business name of an element, while abbreviations will be used in the physical representations of that element, which must be shorter. In addition, each term should be given a description, to avoid confusion where a single term is common to two or more business functions but has different meanings for each. Minus the definitions, your lexicon becomes the Valid Terms List as used by Validator.

Model Business Naming Conventions

The following rules for devising business names are common to a number of standard sets. They are offered here as a collection of "model" data element naming standards that may provide inspiration. Where Validator specifically assists with the enforcement of a standard, the rule is followed by "SPEC" plus the number of the specification screen page on which the option is found. Where Validator does the process automatically, or optionally generates an issue message, the rule is followed by "VAL."

Semantic Rules

Syntax Rules

Structuring Conventions

Physical Naming Conventions

Physical attribute names are constructed by abbreviating the business name, using standard abbreviations. They should conform to the restrictions of the programming language and the DBMS being used (SPEC - 1). (For example, SQL Server imposes an 18 character limit.)

Guideline for Abbreviating Names



Disseminating and Enforcing Standards

Issues include:

Proper Scope. Standards must be set for an entire organization or enterprise; there is no room for separate standards in multiple divisions of a company.

Acceptance and support. Effective standards require "buy-in" from both management and technical development staff. Standards must be mandated by management, and supported by policies and procedures.

Availability. Standards must be published and disseminated through paper documentation and training, and ideally through an automated validation tool such as Validator.

Enforceability. Relevant standards, kept simple and understandable. An easy-to-use testing method must be available which can be used by any local developer who is designing a data structure. Beyond a few generalities, all standards must be expressed as specific criteria which can be tested: e.g., "not more than 50 characters" in contrast to "not too long".

Responsibilities for Maintenance. Standards must be monitored and maintained by a central function such as DA, DBA, or system architecture. Individuals responsible for using standards - such as database developers, must provide information feedback that will be used to ensure that the standards remain practical and relevant in a changing technological and business environment.

Responsibilities for Enforcement. Standards should also be enforced centrally, in the course of system design reviews; however, standards must be primarily enforced as part of normal business practice by development and design staff. Roles include:

Timeliness. Standards must be applied when systems are designed for the first time, not as part of some formal review after-the-fact when coding is half-done and databases are populated with test data. That is too late, and is very costly in resource use and in delayed schedules, and may even result in standards which are ignored for expediency. For this reason, standards enforcement cannot be left solely in the hands of a central group such as DA, but must be part of every developer's responsibility.

Conclusion

Object label design and naming is a big job with invaluable consequences. Adhering to standards is a necessary part of the job if information is to be easily accessed and shared. We have offered some opinions about what makes some standards better or easier to use than others, but ultimately the best standard is one that is actually used. Validator can alleviate the burden by testing labels and offering suggestions for modifications based on standards specified by the user, and can enforce consistency.

Best of all, Validator can be placed in the hands of the people who need it: system designers and database architects doing new systems or system reengineering work. Validator's Enterprise version is designed to be centrally maintained, but to be used by every project team in an organization.








Kismet Analytic Corp. PoBox 3218 Arlington VA 22203 Phone/fax 703-531-3845


Last Updated on November 30, 1997. By: info@kismeta.com


Click to request more information!