[MCOL-3516] Query Performance Problem with Window Functions Created: 2019-09-24  Updated: 2023-03-06  Resolved: 2023-03-06

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.2.4, 1.2.5
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Samet Coban Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: performance, regression

Attachments: XML File Columnstore.xml    

 Description   

Problem:
We are having an issue when using window functions with newer Maria DB Columnstore version such as 1.2.4 and 1.2.5 versions. Same query runs over same data in MariaDB Columnstore 1.0.8 without any issue in seconds but never finishes on newer versions.
Story:
We migrated a database from version 1.0.8 to 1.2.4. After the migration, Users started to complain about non-ending queries with window functions. Processlist shows "creating sort index" as query state.
Test Results:
Below you can see test results in detail (execution time for same query based on different Columnstore versions with different data sizes.) As soon as table passes a certain number of rows(roughly 50K), execution time increases exponentially and becomes unacceptable.
1.2.4 and 1.2.5 versions are running on better hosts with more resources.

Hint:
One thing worth to note down here is that Query status is shown as "Creating Sort Index" for almost entire duration in Test 4 and Test 5 in both newer versions.
I added Columnstore.xml file for 1.2.4 version.

Test 1: Original Query Execution with Full Table Data more than 6 million

Version Table Row Count Execution Time
1.0.8 6 M 19 Seconds
1.2.4 6 M Killed after 6 hours
1.2.5 6 M Killed after 6 hours

Test 2: Limited Data (1000 rows)

Version Table Row Count Execution Time
1.0.8 1K 0.24 seconds
1.2.4 1K 0.22 seconds
1.2.5 1K 0.26 seconds

Test 3: Limited Data (10000 rows)

Version Table Row Count Execution Time
1.0.8 10K 0.30 seconds
1.2.4 10K 9 seconds
1.2.5 10K 1.5 seconds

Test 4: Limited Data (100000 rows)

Version Table Row Count Execution Time
1.0.8 100K 0.80 seconds
1.2.4 100K 10 min 19.03 sec
1.2.5 100K 1 min. 32 seconds

Test 5: Limited Data (1000000 rows)

Version Table Row Count Execution Time
1.0.8 1M 3.63 seconds
1.2.4 1M Killed after 2.5 hours
1.2.5 1M Killed after 2.5 hours

EXPLAIN Results
For 1.0.8:

{
  "query_block": {
    "select_id": 1,
    "const_condition": "1",
    "filesort": {
      "temporary_table": {
        "function": "buffer",
        "table": {
          "table_name": "<derived2>",
          "access_type": "ALL",
          "rows": 2000,
          "filtered": 100,
          "materialized": {
            "query_block": {
              "select_id": 2,
              "table": {
                "table_name": "tblnoporderadd_yesterday",
                "access_type": "ALL",
                "rows": 2000,
                "filtered": 100
              }
            }
          }
        },
        "table": {
          "table_name": "i",
          "access_type": "ALL",
          "rows": 2000,
          "filtered": 100,
          "attached_condition": "trigcond((i.ipaddress = t.dst_ip_left))"
        }
      }
    }
  }
}

1.2.5:

{
  "query_block": {
    "select_id": 1,
    "const_condition": "1",
    "filesort": {
      "sort_key": "if(i.alias is null,t.dst_ip_left,i.alias), cast(t.pct_50 as signed), cast(t.pct_80 as signed), cast(t.pct_90 as signed), cast(t.pct_95 as signed), cast(t.pct_99 as signed)",
      "temporary_table": {
        "table": {
          "table_name": "<derived2>",
          "access_type": "ALL",
          "rows": 2000,
          "filtered": 100,
          "materialized": {
            "query_block": {
              "select_id": 2,
              "window_functions_computation": {
                "sorts": {
                  "filesort": {
                    "sort_key": "tblnoporderadd_yesterday.dst_ip_left, tblnoporderadd_yesterday.latency"
                  }
                },
                "temporary_table": {
                  "table": {
                    "table_name": "tblnoporderadd_yesterday",
                    "access_type": "ALL",
                    "rows": 2000,
                    "filtered": 100
                  }
                }
              }
            }
          }
        },
        "table": {
          "table_name": "i",
          "access_type": "ALL",
          "rows": 2000,
          "filtered": 100,
          "attached_condition": "trigcond(i.ipaddress = t.dst_ip_left)"
        }
      }
    }
  }

// Some comments here
SELECT '2019-09-19' date, if(isnull(i.alias),dst_ip_left,i.alias) host, 'ORDERADD' message_type , cast(pct_50 as integer) pct50, cast(pct_80 as integer) pct80, cast(pct_90 as integer) pct90, cast(pct_95 as integer) pct95, cast(pct_99 as integer) pct99, count(*) samplesize
FROM (  
    SELECT  dst_ip_left, 
            percentile_cont(0.5) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_50,
            percentile_cont(0.8) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_80,
            percentile_cont(0.9) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_90,
            percentile_cont(0.95) within group (order by latency) OVER (PARTITION BY dst_ip_left) pct_95,
            percentile_cont(0.99) within group (order by latency) OVER (PARTITION BY dst_ip_left) pct_99
    FROM Table1
    ) as t
LEFT JOIN Table2 as i on i.ipaddress = t.dst_ip_left
GROUP BY host, pct50, pct80, pct90, pct95, pct99
ORDER BY host;



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2019-09-25 ]

Please try wrapping it in a subquery as follows and let us know if that improves things:

select * from (
SELECT '2019-09-19' date, if(isnull(i.alias),dst_ip_left,i.alias) host, 'ORDERADD' message_type , cast(pct_50 as integer) pct50, cast(pct_80 as integer) pct80, cast(pct_90 as integer) pct90, cast(pct_95 as integer) pct95, cast(pct_99 as integer) pct99, count(*) samplesize
FROM (  
    SELECT  dst_ip_left, 
            percentile_cont(0.5) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_50,
            percentile_cont(0.8) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_80,
            percentile_cont(0.9) within group (order by latency) OVER (PARTITION BY dst_ip_left)  pct_90,
            percentile_cont(0.95) within group (order by latency) OVER (PARTITION BY dst_ip_left) pct_95,
            percentile_cont(0.99) within group (order by latency) OVER (PARTITION BY dst_ip_left) pct_99
    FROM Table1
    ) as t
LEFT JOIN Table2 as i on i.ipaddress = t.dst_ip_left
GROUP BY host, pct50, pct80, pct90, pct95, pct99
ORDER BY host
) as sq;

Comment by Samet Coban [ 2019-09-25 ]

Hello Andrew,

I tried suggested method but unfortunately it didn't help. On the contrary, it caused performance degradation. Normally it finishes in 3 seconds in 1.0.8 version but nested subquery took 32 seconds.

Comment by Todd Stoffel (Inactive) [ 2023-03-06 ]

This ticket was opened prior to convergence with the server. It may have been rendered obsolete. If this issue still exists in a modern version, please open a new request.

Generated at Thu Feb 08 02:43:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.