[MXS-5089] Improve SQL canonical form computation - Jira

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Core, QueryClassifier
Labels:
None

Description

Problem

The query classifier cache uses the canonical form of the SQL query as the cache key. Currently the implementation does not collapse whitespace nor does it remove the leading sign from numbers. The result of this is a less optimal use of the query classifier cache when the values in the SQL may be negative. Additionally, minor changes in the SQL whitespace also are not collapsed which makes manually written SQL not match well. The SQL is also case-sensitive so minor differences in field names and SQL keywords each produce a different canonical SQL.

Improvements

Remove leading negative sign (MXS-5092 )

If last non-whitespace character is not a question mark, the leading sign character can be removed. For example, SELECT 1 - 2 > 3 should have the canonical SELECT ? - ? > ? but SELECT 2, -1, 3 should have the canonical SELECT ?, ?, ?. More generally, the problem to solve is to find the shortest possible canonical form of the SQL statement that accepts all numeric values as its input.

The problematic part is the case where it is not possible to know whether a value is a field identifier or an SQL keyword without semantic understanding of the SQL query. For example, SELECT id FROM t1 WHERE ts > NOW() + INTERVAL -1 SECOND should get converted into SELECT id FROM t1 WHERE ts > NOW() + INTERVAL ? SECOND but there's no way to know that this has to be done without tokenizing the input and detecting that INTERVAL is a SQL keyword. Thus, these cases cannot be improved anymore without tokenization or full parsing.

The SQL minus sign is also a negation of the value which means that SELECT 1, SELECT --1 and SELECT ----1 all produce the same result of 1. Similarly, there are edge cases like SELECT 100+-+-+-+-+-2 where there isn't one canonical form of the SQL that all would agree uniquely identify the SQL. However in practice, these edge cases can be ignored and replacing the first sign value would still improve the overall accuracy of the result.

The leading sign can be eliminated if the first non-whitespace character that is found before the sign is one of the arithmetic, bitwise or logical operators, an opening parenthesis or a comma. The sign cannot be eliminated for a closing parenthesis as it could be the end of a subquery or a function call. As usually the sign is attached to the numeric constant with no spaces, a further simplification could be to only eliminate the sign if it's the preceding character to a numeric constant. In pseudo-code, the logic is as follows:

if start[-1] == '-':

  ptr = start - 1

  while isspace(*ptr):

    --ptr

  if *ptr in "+-*/%&|^=<>(,":

    return start - 1

return start

With the following logic applied the example SQL SELECT 2, -1, 3 will have the canonical form of SELECT ?, ?, ? but the SQL SELECT -1, 2, 3 still has it as SELECT -?, ?, ?. Compared to current implementation where there are 8 possible canonical forms for all integer inputs, the improved version has only two: the one with a negative first question mark and the one with out a sign.

Normalize and collapse repeating whitespace characters

For any two adjacent whitespace characters, replace the first with a space and remove all following ones. This is a simple and mechanical approach to solving a part of the problem but there are still cases where this will produce multiple canonical SQLs for a single "true" canonical SQL. For example, CREATE TABLE t1 (id INT) and CREATE TABLE t1(id INT) are the same query except that theres's an extra space between the t1 and the opening parenthesis. To detect these cases, spaces before an opening parenthesis is (possibly) safe to just remove.

Collapse repeating values

For queries like SELECT name FROM t WHERE id IN (1, 2, 3, 4) and SELECT name FROM t WHERE id IN (1, 2, 3), the queries are essentially the same except the "parameter" count differs. These could be collapsed into SELECT name FROM t WHERE id IN so that all possible permutations of IN() list values are matched to the same cache entry. In practice this is probably a bit more difficult to implement correctly as a query like SELECT 1, 2 should not be collapsed as it is not an IN list but the field list that is returned.

Attachments

Sub-Tasks

Remove leading sign in SQL canonical form

Closed

markus makela

Activity

People

Assignee:: Unassigned

Reporter:: markus makela

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-04-29 08:20

Updated:: 2024-12-05 15:02

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.