Quoted Identifiers #JoelKallmanDay

Quoted Identifiers

Quoted Identifiers

Background and TL;DR

Connor McDonald wrote a blog post named Cleaner DDL than DBMS_METADATA. Back then he asked me if it would be possible to let the formatter remove unnecessary double quotes from quoted identifiers. Yes, of course. Actually, the current version of the PL/SQL & SQL Formatter Settings does exactly that. And no, you cannot do that with dbms_metadata in the Oracle Database versions 19c and 21c. Read on if you are interested in the details.

The Problem with DBMS_METADATA …

When you execute a DDL against the Oracle Database the database stores some metadata in the data dictionary. The DDL statement itself is not stored, at least not completely. This is one reason why it is a good idea to store DDL statements in files and manage these files within a version control system. Nevertheless, the Oracle Database provides an API – dbms_metadata – to reconstruct a DDL based on the information available in the data dictionary.

Let’s create a view based on the famous dept and emp tables:

and retrieve the DDL like this:

to produce this DDL:

We see that the subquery part of the view has been preserved, except the first line which has a different indentation. Actually, the indentation of the first line is not stored in the data dictionary (see user_views.text). The two spaces are produced by the default pretty option of dbms_metadata. So far, so good.

In many cases, the Oracle Data Dictionary explicitly stores default values. For instance, Y for editionable or USING_NLS_COMP for default_collation. This fact alone makes it impossible to reconstruct the original DDL in a reliable way. The database simply does not know whether an optional clause such as editionable or default collation has been specified or omitted. Moreover, some optional DDL clauses such as or replace or force are simply not represented in the data dictionary.

… Especially with Quoted Identifiers

And last but not least, identifiers such as columns names, table names or view names are stored without double quotes. Therefore, the database knows nothing about the use of double quotes in the original DDL. However, the database knows exactly when double quotes are required. As a result, dbms_metadata could emit only necessary double quotes. This would result in a more readable DDL and would probably also be more similar to the original DDL.

The reality is that code generators such as dbms_metadata often use double quotes for all identifiers. It’s simply easier for them, because this way the generated code works for all kind of strange identifiers.

However, using quoted identifiers is a bad practice. It is, in fact, a very bad practice when they are used unnecessarily.

Shaping the DDL

So what can we do? We can configure dbms_metadata to produce a DDL which is more similar to our original one. In this case we can change the following:

  • remove the schema of the view (owner)
  • remove the force keyword
  • remove the default collation clause
  • add the missing SQL terminator (;)

This query

produces this result:

This looks better. However, I would like to configure dbms_metadata to omit the default editionable clause. Furthermore, I do not like the column alias list, which is unnecessary in this case. And of course I’d like to suppress unnecessary double quotes around identifiers. Is that possible with dbms_metadata?

Shaping the DDL from (S)XML

Well, we can try. The dbms_metadata API is very extensive. Besides other things, it can also represent metadata as an XML document. There are two formats.

  • XML – An extensive XML containing internals such as object number, owner number, creation date, etc.
  • SXML – A simple and terse XML that contains everything you need to produce a DDL. The SXML format is therefore very well suited for schema comparison.

It’s possible to produce a DDL from both formats. We can also change the XML beforehand.

Let’s look at both variants in the next two subchapters.

Important: I consider the changes to the XML document and configuration of dbms_metadata in the following subchapters as experimental. The purpose is to show what is doable. They are not good examples of how it should be done. Even though the unnecessary list of column aliases annoys me, I would leave them as they are. I also think that overriding the default VERSION is a very bad idea in the long run.

Convert XML to DDL

The query produces the following two rows (CLOBs):

We removed the OWNER_NAME node (on line 11) from the XML document. As a result, the schema was removed in the DDL. The result is the same as with the REMAP_SCHEMA transformation. Perfect.

We also removed the COL_LIST node (line 48-99) from the XML document. However, the result in the DDL regarding the column alias list does not look good. The columns are gone, but the surrounding parentheses survived, which makes the DDL invalid. IMO this is a bug of the $ORACLE_HOME/rdbms/xml/xsl/kuview.xsl script. It’s handled correctly in the SXML script as we will see later. However, we can fix that by calling replace(..., '" () AS', '" AS') . Please note that a complete solution should do some checks to ensure that the COL_LIST is really not required.

When you look at line 12 in the XML document (<NAME>DEPTSAL</NAME>), you see that the view name does not contain double quotes. This is a strong indicator, that there is no way to remove the double quotes by manipulating the input XML document. In fact, the double quotes are hard-coded in all XSLT scripts. No way to override this behavior via dbms_metadata.

Furthermore you do not find a node named like EDITIONABLE with a value Y as in all_objects. Why? Because this information is stored in the node FLAGS. 0 means editionable and 1048576 means noneditionable. To be precise 1048576 represents bit number 21. If this bit is set then the view is noneditionable. You find the proof for this statement in the dba_objects view, where the expression for the editionable column looks like this:

The $ORACLE_HOME/rdbms/xml/xsl/kucommon.xsl script (see template Editionable) is evaluating this flag and either emitting a EDITIONABLE or NONEDITIONABLE text. These keywords were introduced in version 12.1. Since dbms_metadata can produce version specific DDL, we set the version to 11.2 to suppress EDITIONABLE in the resulting DDL.

Convert SXML to DDL

The query produces the following two rows (CLOBs):

The SXML document is smaller. It contains just the nodes to produce a DDL. That makes it easier to read.

We removed the SCHEMA node (on line 2) from the SXML document. As a result, the schema was removed in the DDL. But not completely. Two double quotes and one dot survived, which makes the DDL invalid. IMO this is a bug of the $ORACLE_HOME/rdbms/xml/xsl/kusviewd.xsl script. It’s handled correctly in the XML script. We could fix that with a replace(..., 'VIEW ""."', 'VIEW "') call. As long as the search term is not ambiguous, everything should be fine.

We also removed the COL_LIST node (line 5-21) from the SXML document. In this case the column alias list is completely removed from the DDL. Including the parentheses. Nice.

Maybe you wonder how editionable is represented in the SXML document. – With a NONEDITIONABLE node if the view is noneditionable.

How Can We Work Around the Limitations?

We’ve seen the limitations of the current dbms_metadata API and the necessity to use string manipulation functions to fix invalid DDL.

There is no way to remove double quotes from quoted identifiers with dbms_metadata. However, as Connor McDonald demonstrated in his blog post we can remove them with some string acrobatics. Why not use a simple replace call? Because there are some rules to follow. A globally applied replace(..., '"', null) call would produce invalid code in many real life scenarios. We need a more robust solution.

Applying the rules in a code formatter can be such a robust solution.

Rules for Safely Removing Double Quotes from Quoted Identifiers

What are the rules to follow?

1. Is a SQL or PL/SQL Identifier

You have to make sure that the double quotes surround a SQL or PL/SQL identifier. Sounds logical. However, it is not that simple. Here are some examples:

You can solve the first three examples easily with a lexer. A lexer groups a stream of characters. Such a group of characters is called a lexer token. A lexer token knows the start and end position in a source text and has a type. The lexer in SQL Developer and SQLcl produces the following types of tokens:

  • COMMENT (/* ... */)
  • LINE_COMMENT (-- ...)
  • QUOTED_STRING ('string' or q'[string]')
  • DQUOTED_STRING ("string" )
  • WS (space, tab, new line, carriage return)
  • DIGITS (0123456789 plus some special cases)
  • OPERATION (e.g. ()[]^-|!*+.><=,;:%@?/~)
  • IDENTIFIER (words)
  • MACRO_SKIP (conditional compilation tokens such as $if, $then, etc.)

We can simply focus on tokens of type DQUOTED_STRING and ignore tokens that are within conditional compilation tokens $if and $end.

To find out if a DQUOTED_STRING is part of a Java stored procedure is more difficult. Luckily SQL Developer’s parser cannot deal with Java stored procedures and produces a parse error. As a result, we just have to keep the code “as is” in such cases.

2. Consists of Valid Characters

According to the PL/SQL Language Reference a nonquoted identifier must comply with the following rules:

An ordinary user-defined identifier:

  • Begins with a letter
  • Can include letters, digits, and these symbols:
    • Dollar sign ($)
    • Number sign (#)
    • Underscore (_)

What is a valid letter in this context? The SQL Language Reference defines a letter as an “alphabetic character from your database character set”. Here are some examples of valid letters and therefore valid PL/SQL variable names or SQL column names:

  • Latin letters (AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz)
  • Umlauts (ÄäËëÏïÖöÜüŸÿ)
  • German Esszett (ẞß), please note that the Oracle Database does not convert the case of an Esszett, because the uppercase Esszett exists offically only since 2017-03-29
  • C cedilla (Çç)
  • Grave accented letters (ÀàÈèÌìÒòÙù)
  • Acute accented letters (ÁáĆćÉéÍíÓóÚúÝý)
  • Circumflex accented letters (ÂâÊêÎîÔôÛû)
  • Tilde accented letters (ÃãÑñÕõ)
  • Caron accented letters (ǍǎB̌b̌ČčĚěF̌f̌ǦǧȞȟǏǐJ̌ǰǨǩM̌m̌ŇňǑǒP̌p̌Q̌q̌ŘřŠšǓǔV̌v̌W̌w̌X̌x̌Y̌y̌ŽžǮǯ)
  • Ring accented letters (ÅåŮů)
  • Greek letters (ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσΤτΥυΦφΧχΨψΩω)
  • Common Cyrillic letters (АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя)
  • Hiragana letters (ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖゝゞゟ)

The Oracle Database throws an ORA-00911: invalid character when this rule is violated.

Cause: The identifier name started with an ASCII character other than a letter or a number. After the first character of the identifier name, ASCII characters are allowed including “$”, “#” and “_”. Identifiers enclosed in double quotation marks may contain any character other than a double quotation. Alternate quotation marks (q’#…#’) cannot use spaces, tabs, or carriage returns as delimiters. For all other contexts, consult the SQL Language Reference Manual.

The cause for this error message seems to be outdated, inaccurate and wrong. Firstly, it limits letters to those contained in an ASCII character set. This limitation is not generally valid anymore. Secondly, it claims that an identifier can start with a number, which is simply wrong. Thirdly, ASCII characters and letters are used as synonyms, which is misleading.

However, there are still cases where an identifier is limited to ASCII characters or single byte characters. For example, a database name or a database link name. In the projects I know, the reduction of letters to A-Z for identifiers is not a problem. The use of accented letters in identifiers are typically oversights. Therefore, I recommend limiting the range of letters in identifiers to A-Z .

Checking this rule is quite simple. We just have to make sure that the quoted identifier matches this regular expression: ^"[A-Z][A-Z0-9_$#]*"$. This works with any regular expression engine, unlike ^"[[:alpha:]][[:alpha:]0-9_$#]*"$.

3. Is Not a Reserved Word

According to the PL/SQL Language Reference and the SQL Language Reference a nonquoted identifier must not be a reserved word.

If you are working with 3rd party parsers, the list of reserved words might not match the ones defined for the Oracle Database. In my case I also want to consider the reserved words defined by db* CODECOP. I’m using the following query to create a JSON array with currently 260 keywords:

The result can be used to populate a HashSet. This allows you to check very efficiently whether an identifier is a keyword.

Of course, such a global list of keywords is a simplification. In reality, the restrictions are context-specific. However, I consider the use of keywords for identifiers in any context a bad practice. Therefore, I can live with some unnecessarily quoted identifiers.

4. Is in Upper Case

This means that the following condition must be true: quoted_identifier = upper(quoted_identifier).

This does not necessarily mean that the identifier is case-insensitive as the following examples show:

You can check this rule in combination with the previous rule by using a case-insensitive regular expression, which is the default.

5. Is Not Part of a Code Section for Which the Formatter Is Disabled

When you use a formatter there are some code sections where you do not want the formatter to change it. Therefore we want to honor the marker comments that disable and enable the formatter.

Here is an example:

After calling the formatter we expect the following output (when changing identifier case to lower is enabled):

To check this we can reuse the approach for quoted identifiers in conditional compilation text.

Removing Double Quotes from Quoted Identifiers with Arbori

As mentioned at the beginning of the post, the current version of the PL/SQL & SQL Formatter Settings can safely remove double quotes from PL/SQL and SQL code.

Here are simplified formatter settings which can be imported into SQL Developer 22.2.1. The formatter with these settings only removes the double quotes from identifiers in a safe way and leaves your code “as is”. You can download these settings from this Gist.

Firstly, the option adjustCaseOnly ensures that the Arbori program is fully applied.

Secondly, the option singleLineComments ensures that the whitespace before single line comments are kept “as is”.

Thirdly, the maxCharLineSize ensures that no line breaks are added. The value of 120000 seems to be ridiculous high. However I’ve seen single lines of around hundred thousand characters in the wild.

Fourthly, the option idCase ensures that the case of nonquoted identifiers is not changed. This is important for JSON dot notation.

Fifthly, the option kwCase ensures that the case of keywords is also kept “as is”.

And Finally, the option formatWhenSyntaxError ensures that the formatter does not change code that the formatter does not understand. This is important to keep Java strings intact.

The value of all other options are irrelevant for the this Arbori program.

Firstly, the lines 1 to 8 are required by the formatter. They are not interesting in this context.

Secondly, the lines 9 to 42 are the heart of a lightweight formatter. This code ensures that all whitespace between all tokens are kept. Therefore, the existing format of the code remains untouched. Read this blog post to learn how SQL Developer’s formatter works.

Thirdly, the lines 43 to 173 remove unnecessary double quotes from identifiers. We store the position of double quotes to be removed on line 147 and 148 in an array named delpos while processing all tokens from start to end. The removal of the double quotes happens on line 153 while processing delpos entries from end to start.

And finally, the lines 174-180 define an Arbori query named identifier. The formatter uses this query to divide lexer tokens of type IDENTIFIER into keywords and identifiers. This is important to ensure that the case of identifiers is left “as is” regardless of the configuration of kwCase.

Doesn’t Connor’s PL/SQL Function Do the Same?

No, when you look closely at the ddl_cleanup.sql script as of 2022-03-02, you will find out that the ddl function has the following limitations:

  • Quoted identifiers are not ignored in
    • Single and multi-line comments
    • Conditional compilation text
    • Code sections for which the formatter is disabled
  • Java Strings are treated as quoted identifiers
  • Reserved keywords are not considered
  • Nonquoted identifiers are changed to lower case, which might break code using JSON dot notation

It just shows that things become complicated when you don’t solve them in the right place. In this case dbms_metadata‘s XSLT scripts. dbms_metadata knows what’s an identifier. It can safely skip the enquoting process if the identifier is in upper case, matches the regular expression ^[A-Z][A-Z0-9_$#]*$ and the identifier is not a reserved keyword. That’s all. The logic can be implemented in a single XSL template. We API users on the other side must parse the code to somehow identify quoted identifiers and its context before we can decide how to proceed.

Formatting DDL Automatically

You can configure SQL Developer to automatically format DDL with your current formatter settings. For that you have to enable the option Autoformat Dictionary Objects SQL as in the screenshot below:

Autoformat DDL

Here’s the result for the deptsal view using the PL/SQL & SQL Formatter Settings:

Autoformat in Action

The identifiers in upper case were originally quoted identifiers. By default, we configure the formatter to keep the case of identifiers. This ensures that code using JSON dot notation is not affected by a formatting operation.

Processing Many Files

SQL Developer is not suited to format many files. However, you can use the SQLcl script or the standalone formatter to format files in a directory tree. The formatter settings (path to the .xml and .arbori file) can be passed as parameters. I recommend using the standalone formatter. It uses the up-to-date and much faster JavaScript engine from GraalVM. Furthermore, the standalone formatter also works with JDK 17, which no longer contains a JavaScript engine.

You can download the latest tvdformat.jar from here. Run java -jar tvdformat.jar to show all command line options.

Summary

If your code base contains generated code, then it probably also contains unnecessarily quoted identifiers. Especially if dbms_metadata was used to extract DDL statements. Removing these double quotes without breaking some code is not that easy. However, SQL Developer’s highly configurable formatter can do the job, even without actually formatting the code.

I hope that some of the shortcomings of dbms_metadata will be addressed in an upcoming release of the Oracle Database. Supporting nonquoted identifiers as an additional non-default option should be easy and not so risky to implement.

Anyway, instead of just detecting violations of G-2180: Never use quoted identifiers, it is a good idea to be able to correct them automatically.

Please open a GitHub issue if you encounter a bug in the formatter settings. Thank you.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.