This is one of my favourite quotes:

There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton

IT is my daily life. And this quote is so true. Lately, I’ve been thinking much more than usual about naming, and that names really matter. This led to some refactoring activities.

Why Is Naming Important?

When a person with German mother tongue hears the word “eagle”, she or he automatically associates it with a “hedgehog”. Simply because the German word for it (“Igel”) is pronounced exactly the same. Of course, language skills and the concrete context play a role. The point is, a wrong association is likely. When we give a name to a thing, we basically want to avoid such false associations. In the best case, they are not helpful. In the worst case, this leads to rejection, as the next example shows.

In 1982 Mitsubishi Motors launched a SUV with the name “Pajero”. This name had to be changed in some regions, because “pajero” means “wanker” in Spanish. This example also shows that it is more important what others think about a name than we do.

In IT we have to name many things. Databases, schemas, tables, columns, views, packages, triggers, variables, fields, methods, classes, modules, components, products, etc. etc. Using an established name with a known and accepted definition help others to understand it better.

When we use a name, it is actually associated with a definition and properties, whether we like it or not. When names have a common and widely accepted meaning, it simplifies the communication. For example “banana”. Everybody knows what it means. Merriam-Webster’s definition is:

An elongated usually tapering tropical fruit with soft pulpy flesh enclosed in a soft usually yellow rind.

And I am sure that each of us could add a few characteristics to this definition.

Why Is Naming Difficult?

A name must fulfil many characteristics. For example

  • Short
  • Fitting (naturally relates to the intended meaning, and characteristics)
  • Easy to spell, pronounce, remember
  • Not associated with unwanted characteristics
  • Common and widely accepted meaning and definition, that fits the intention (for names without commercial value)
  • New, not used already (for marketable names)

Depending on context there are some goal conflicts. However, even without a major conflict, it is difficult to name something adequately in the early stages. Because we do not know enough about the thing we want to name. Hence, we use an iterative approach. We name something (e.g. an entity, package or class) and while working on it we find out that the name does not fit (anymore) and we change it. Maybe we split the thing and have to name now two things, etc. etc.

Finding a fitting name means doing some research. How have others named that thing? What is the definition of it? Does it fit 100 percent? This is an interesting and instructive work. In any case, it takes time. And at the time we need a new name, we want it now (e.g. when a wizard asks for a name). We can always rename it later, right? – Technically yes. And often we do. But the longer we wait, the less likely we are renaming.

Are Some Names More Important Than Others?

Yes. The more visible a name is the more important it is.

For example, the names behind an API are very easy to change. We do not have to ask anyone before changing it. It’s no problem as long as the API provides the same results. That’s one of the reasons we strive for tight APIs, right? To get some leeway.

As soon as others are involved, we are not fully in control of the change anymore. For example, when I change a name in one of my blog posts, this change is visible immediately to everyone visiting my blog. But I cannot control the caches of others, like search engines, blog mirrors and other services that copy web content to third-party storages. Remember, cache invalidation is the other hard thing in IT.

As a consequence, before we release an artefact that becomes visible to others, we should take some time to verify the used names. We cannot take back what we’ve said (at least not completely). However, we are in control of what we say in the future.

Banned Names on This Blog

Some terms (names) were discussed recently (again) due to a series of sad events. I used these terms as well. I never really thought about them as “bad”. However, I’ve changed my mind. I’m part of the problem. And I do not like it. One thing I can do is to stop using terms, that a large group of people associate with slavery and racism. No big deal, right?

This is another quote I like very much:

One cannot not communicate
— Paul Watzlawick

It is difficult to draw a line for certain terms. However, I believe that “you cannot not decide”. You decide either explicitly or implicitly. Of course, very seldom something is purely black or white. It’s much more often a shade of grey. Some decision take some time. And that’s okay. But it is impossible to postpone a decision forever. At a certain point, it becomes a decision.

So, I decided to decommission some terms on this blog and introduce new ones. Here’s the list:

Current TermDecommissioned TermContext
accessiblewhite listedPL/SQL accessible_by clause
agentslaveJenkins
exclusion listblacklistPL/SQL Cop, PL/SQL accessible_by clause
inclusion listwhitelistPL/SQL Cop, PL/SQL accessible_by clause
mainmasterGit branch
transaction structure data + enterprise structure datamaster dataData modeling
workerslaveOracle DB background process

Finding alternative names was surprisingly easy because others had already done the work and defined alternative names. They existed for years…

Master Data

However, finding an alternative for master data was harder. I reached out to my friends on Twitter. And got some helpful feedback. Finally, Robert Marti suggested having a look at Malcolm Chisholm‘s book Managing Reference Data in Enterprise Databases. On page 258ff the different data classes are defined and explained. The book is from 2000. In the meantime Malcolm Chisholm has published revised definitions here and here.

In the next subchapter, I repeat the definition of the data groups defined by Malcolm Chisholm on slide 5 in this deck. I like these definitions and plan to use them in the future.

Metadata

The data that describes all aspects of an enterprise’s information assets, and enables the enterprise to effectively use and manage these assets.

Here it is confined to the structure of databases. Found in a database’s system catalog. Sometimes included in database tables.

Reference Data

Any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

Codes and descriptions. Tables containing this data usually have just a few rows and columns.

Transaction Structure Data

Data that represents the direct participants in a transaction, and which must be present before a transaction fires.

The parties to the transactions of the enterprise. E.g. Customer, Product.

Enterprise Structure Data

Data that permits business activity to be reported and/or analyzed by business responsibility.

Typically, data that describes the structure of the enterprise. E.g. organizational or financial structure.

Transaction Activity Data

Data that represents the operations an enterprise carries out.

Traditional focus of IT – in many enterprises the only focus.

Transaction Audit Data

Data that tracks the life cycle of individual transactions.

Includes application logs, database logs, web server logs.

Summary

You use a name to simplify communication. A name is a proxy for a longer definition and meaning. If the meaning is badly received by others and especially by the target community, this does not simplify communication. Using a different name sounds like a simple solution. Why not, if changing a name is simple enough?

In this case, I only had to edit a few blog posts. I handled them like typos. This means that I did not add any update information. I also had to register new URL redirects. That was straightforward. However, changing the branch name in 26 GitHub repositories was a bit more work than anticipated, because I also had to change URLs in several related files. For certain GitHub pages, I had to keep a non-default master branch. I suppose that sooner or later GitHub will allow me to get rid of them as well. If I had to change more repositories, I would probably automate this task.

Most of the time I spent finding an alternative name for “master data”. In the end, I learned something new and found good names and definitions. That will help me in the future.

2 Comments

  1. blank Michael Milligan says:

    Thank you for this excellent, thoughtful article.

    Being raised by conservative parents who nonetheless taught us – emphatically – that “words can hurt”, I have always paid attention to ontology. Your new terms have another major advantage: clarity. The terms themselves either encompass the definition or, at the very least, indicate their meaning directly rather than euphamistically.

    An example of a self-defining term is “exclusion list”. It could not be any clearer.

    An example of a term with a direct meaning is “agent”. Why use a euphamism, when a word whose meaning is commonly understand exists? The use of euphamisms requires all to know both it and the context of its use.

    One further suggestion for the industry: let’s stop using the terms “big endian” and “little endian”. No doubt these were coined as memory mnemonics, but their use is and always was in poor taste, not to mention too cute by half. The terms “big ending” and “little ending” work just fine for me.

    Thanks again,

    Michael Milligan
    Layton, Utah

    • Many thanks for your positive feedback.

      Regarding endianness. I like the relation to Gulliver’s Travels and the fact that the Lilliputians are called Little-Endians because they open a boiled egg on the little end. I see no problem in continuing to use the terms little-endian, big-endian or bi-endian. They are well defined and not discriminating IMO.