IslandSQL Episode 4: Expressions

2023-03-05

Introduction

In the last episode, we extended the IslandSQL grammar covering the complete lock table statement. However, the support for expressions was very limited. It was not possible to use a date literal or to_date function to determine a partition to be locked. Time to fix that. In this episode, we will have a closer look at some SQL expressions and how to deal with the complexity of the SQL language.

The full source code is available on GitHub and the binaries are on Maven Central.

Lock Table Partition

To lock a table partition we use the partition name like this:

1) Lock partition by name

lock table sales partition (sales_q2_2000) in exclusive mode nowait;

Or we let the Oracle Database determine the partition by passing the values of a partition key like this:

2) Lock partition by value

lock table sales partition for (date '2000-04-01') in exclusive mode nowait;

Both statements will lock the same partition of the table sales. The latter is IMO better since it works also with system-generated partition names, which you might use with interval partitioning.

There are a lot of possibilities to specify the date. Here is a selection of sensible and less sensible alternatives:

3) Lock partition by value – more variants

lock table sales partition for (to_date('2000-04', 'YYYY-MM')) in exclusive mode nowait;
lock table sales partition for (to_date('2000-092', 'YYYY-DDD')) in exclusive mode nowait;
lock table sales partition for (timestamp '2000-04-01 08:42:42') in exclusive mode nowait;
lock table sales partition for (add_months(trunc(to_date('2000-12-31', 'YYYY-MM-DD'), 'YYYY'), 3)) in exclusive mode nowait;
lock table sales partition for (date '2000-01-01' + interval '3' month) in exclusive mode nowait;
lock table sales partition for (timestamp '2000-12-31 18:00:00' + 1/4 + -9 * interval '1' month) in exclusive mode nowait;

All these alternatives lock the partition in the sales table where the data for 1st April 2000 is stored. Using different SQL expressions, of course.

Expressions

I mentioned in my last post that expressions are the most extensive part of the SQL grammar. However, there are ways to optimise the extent of the grammar. ANTLR 4 helps because it allows defining left recursive parts of a grammar naturally. See the expressions labelled with binaryExpression in the excerpt of the parser in version 0.4.0 of the grammar below.

4) Expressions in v0.4.0 of IslandSQL

expression:
      expr=STRING                                               # simpleExpressionStringLiteral
    | expr=NUMBER                                               # simpleExpressionNumberLiteral
    | K_DATE expr=STRING                                        # dateLiteral
    | K_TIMESTAMP expr=STRING                                   # timestampLiteral
    | expr=intervalExpression                                   # intervalLiteral
    | expr=sqlName                                              # simpleExpressionName
    | LPAR exprs+=expression (COMMA exprs+=expression)* RPAR    # expressionList
    | expr=caseExpression                                       # caseExpr
    | operator=unaryOperator expr=expression                    # unaryExpression
    | expr=functionExpression                                   # functionExpr
    | expr=AST                                                  # allColumnWildcardExpression
    | left=expression operator=(AST|SOL) right=expression       # binaryExpression
    | left=expression
        (
              operator=PLUS
            | operator=MINUS
            | operator=VERBAR VERBAR
        )
      right=expression                                          # binaryExpression
    | left=expression operator=K_COLLATE right=expression       # binaryExpression
    | left=expression operator=PERIOD right=expression          # binaryExpression
;

In line 13 we deal with multiplication and division. Both sides of the operation allow an expression. Unlike grammars based on ANTLR 3, it is no longer necessary to left-factor this left-recursion in ANTLR 4.

ANTLR 4 solves the ambiguity by prioritizing the alternatives in the order of the definition. As a result, multiplication/division has a higher priority than addition/subtraction and a lower priority than the function expression on line 11.

Left-factoring still might be helpful to optimize the runtime performance of the parser. However, it’s not necessary anymore. And I’m happy with a simpler grammar.

Function Expressions

It would be a bad idea to handle each function like to_date or to_char separately in the grammar. Why? It would be more work and we still would need a solution for custom functions. As a result, we generically implement functions. The most naïve grammar definition would look like this:

5) Naïve functionExpression

functionExpression:
    name=sqlName LPAR (params+=expression (COMMA params+=expression)*)? RPAR
;

Parameterless functions that do not allow parentheses like sysdate will be treated as simpleExpressionName. This rule handles parameterless functions that require parentheses like sys_guid(). And it can handle functions with an unbounded number of parameters.

So what’s the problem? Why do I call this definition “naïve”? Because the following cases are not covered:

named parameters, e.g. dbms_utility.get_hash_value(name => 'text1', base => 0, hash_size => 16)
parameter prefixes, e.g. distinct or all in any_value
parameter suffixes, e.g. deterministic in approx_median or partition_by_clause/order_by_clause in approx_rank
partial analytic clauses, e.g. within group in approx_percentile
analytic clauses, e.g. over in avg

In most cases, it makes probably sense to handle them generically. However, there are cases where it’s probably better to define dedicated grammar rules for very special functions such as json_table or xml_table.

You find a less naïve implementation on GitHub. However, it is still incomplete. IMO a good way to assess completeness is to write tests. For function expressions, this means tests for every single function according to the SQL Language Reference. My tests currently cover abs through cardinality. The next function on the to-do list is cast, which contains the following uncovered grammar constructs:

parameter prefix multiset for a subquery expression, which is also not yet covered
parameter suffix as type_name
parameter suffix DEFAULT return_value ON CONVERSION ERROR

Shall we define a dedicated grammar rule for the cast function? I guess yes. However, I’d probably implement multiset as a unary operator and the subquery as part of the expression rule. For that, we need to implement the grammar for the complete select statement.

Knowing the Scope and the Limitations

A typical question I get is “Why are you writing a SQL grammar? There are already some available for ANTLR 4, right?”. That’s an excellent question. The ANLTR organisation on GitHub manages a repository with example grammars. You find the ones for SQL here.

Most of the 3rd party grammars cover just a (large) subset of the underlying languages. They define what they cover in ANTLR or by EBNF. However, they often do not define which versions they cover and they do not define what they don’t cover. As a result, you have to try if the grammar is sufficient for your use case. Furthermore, you have to assess if it is sufficient for future use cases and if it will cover the changes in newer versions. And of course, you have to decide if you are ready to extend/fix the grammar in areas where it does not meet your expectations. You will have to maintain a fork. This can become painful, especially if the grammar contains a lot of stuff you are not interested in and if the existing test suites are incomplete.

Furthermore, I’m not aware of an open-sourced grammar that covers the relevant portions for the Oracle Database and PostgreSQL. Yes, the goal is that IslandSQL supports the recent versions of both dialects as if they were a single dialect.

We develop the grammar iteratively. As a result, there are a lot of interim limitations like the one mentioned regarding the cast function.

SQL*Plus Substitution Variables

However, some limitations are supposed to be permanent. Like the one for substitution variables. Substitution variables can contain arbitrary text. They are replaced before the execution of a script. The IslandSQL grammar provides limited support for substitution variables. They can be used in places where a sqlName is valid. This is basically everywhere you can use an expression.

Here’s an example of a supported usage:

6) Example of supported usage of substitution variables

lock table &table_name in exclusive mode wait &seconds;

And here’s an example of an unsupported usage:

7) Example of unsupported usage of a substitution variable

lock table dept in &lock_mode mode nowait;

The grammar expects certain keywords at the position of &lock_mode. Here’s the excerpt of the grammar that should make that clear:

8) Exerpt of IslandSqlParser.g4 – lock modes

lockTableStatementUnterminated:
    K_LOCK K_TABLE objects+=lockTableObject (COMMA objects+=lockTableObject)*
        K_IN lockmode=lockMode K_MODE waitOption=lockTableWaitOption?
;

lockMode:
      K_ROW K_SHARE                 # rowShareLockMode
    | K_ROW K_EXCLUSIVE             # rowExclusiveLockMode
    | K_SHARE K_UPDATE              # shareUpdateLockMode
    | K_SHARE                       # shareLockMode
    | K_SHARE K_ROW K_EXCLUSIVE     # shareRowExclusiveLockMode
    | K_EXCLUSIVE                   # exclusiveLockMode
;

And the next excerpt shows how substitution variables are defined in the grammar.

9) Exerpt of IslandSqlParser.g4 – substitution variables

sqlName:
      unquotedId
    | QUOTED_ID
    | substitionVariable
;

substitionVariable:
    AMP AMP? name=substitionVariableName period=PERIOD?
;

substitionVariableName:
      NUMBER
    | sqlName
;

A substitution variable starts with one or two &. The name can be either a NUMBER (like 1) or a sqlName (like tableName). And a substitution variable can optionally end on .. That’s it. This way we can provide limited support for SQL*Plus substitution variables.

Outlook

The grammar evolves quite nicely. However, expressions are still incomplete. This will be covered with the full support of the selectstatement.

As I’m sure you’ve already found out for yourself, the version of the grammar matches the episodes in this blog post series. And these are the planned versions of IslandSQL with their main features:

v0.5.0: Fully parse select statement, complete expressions and conditions
v0.6.0: Fully parse remaining DML statements (call, delete, explain plan, insert, merge, update)
v0.7.0: PostgreSQL syntax compatibility of implemented statements
v0.8.0: Fully parse PL/SQL block
v0.9.0: Fully parse create statements of the Oracle Database (function, package, procedure, trigger, type, view)
v0.10.0: pgPL/SQL syntax compatibility (sql and plpgsql language in create function, create procedure, create trigger and do)

And after episode 10 the fun begins. We can start to provide value for database developers and others. A linter is one option. But there is more. Stay tuned.

2 Comments

Friedhold Matz says:

2023-03-05 at 20:08

Super great project and my respect, Philipp!

Thanks, for the source deployment in GitHub and Maven and I also like the goals.

Good luck, Friedhold.
- Philipp Salvisberg says:
  
  2023-03-05 at 22:20
  
  Thank you, Friedhold. There is still a lot of work ahead. Cheers, Philipp