[MDEV-14940] Query runs forever after consecutive tpc-ds query execute Created: 2018-01-12  Updated: 2018-01-23

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.2.12
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None
Environment:

VM CentOS 7.4 ;MemTotal: 16G ; 2CPUs


Attachments: Text File information_schema.processlist and show explain .txt     HTML File mysqld_stack_trace     Text File mysqld_stack_trace_2.txt     File tpc-ds_query_set.sql     File tpcds.sql    

 Description   

Deadlock after consecutive tpc-ds query execute

how to repeat:
1. load tpc-ds schema -attached file tpcds.sql
2 .load 1G data ; *note: a smallest data set and querying over is not to performance test
3. start one by one query execution -attached file tpc-ds_query_set.sql
it appear query hanging after ~90 queries were run

note2 : the hanging query does no cause the problem ; after killing and rerun only that query is does not stick
note3: added mysqld stack trace in the attachment

query_id        exec_time[ms]
query1  5476
query2  129025
query3  155
query4  217666
query5  Error     with rollup and order by
query6  3489
query7  33725
query8  275440
query9  26337
query10 47695
query11 154767
query12 2379
query13 5956
query14 Error
query15 14324
query16 67279
query17 4752
query18 Error  
query19 676
query20 4323
query21 38041
query22 Error   with rollup and order by
query23 792367
query24 19976
query25 4011
query26 20594
query27 Error  with rollup and order by
query28 19348
query29 3875
query30 2762
query31 106008
query32 72
query33 43687
query34 40095
query35 49031
query36 Error  with rollup and order by
query37 12902
query38 57946
query39 148487
query40 1104
query41 2006
query42 503
query43 35719
query44 6617
query45 8411
query46 46820
query47 145513
query48 11424
query49 789
query50 2656
query51 Error  full outer join
query52 554
query53 1422
query54 99968
query55 588
query56 3025
query57 46472
query58 55214
query59 77638
query60 10213
query61 8544
query62 15741
query63 1439
query64 4611
query65 54447
query66 26477
query67 Error   with rollup and order by
query68 48090
query69 49079
query70 Error   with rollup and order by
query71 49355
query72 2952775
query73 39728
query74 149570
query75 12586
query76 6001
query77 Error   with rollup and order by
query78 202969
query79 44942
query80 Error   with rollup and order by
query81 4732
query82 12757
query83 7409
query84 20888
query85 1445
query86 Error   with rollup and order by
query87 Error   except
query88 314646
query89 2832
query90 18334
query91 1446
query92 70
query93 3119

MariaDB [foo]> show processlist ;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    160
Current database: foo

+-----+-------------+-----------------------+---------+---------+-------+--------------------------+----------------------------                             --------------------------------------------------------------------------+----------+
| Id  | User        | Host                  | db      | Command | Time  | State                    | Info                                                                                                                              | Progress |
+-----+-------------+-----------------------+---------+---------+-------+--------------------------+----------------------------                             --------------------------------------------------------------------------+----------+
|   2 | system user |                       | NULL    | Daemon  |  NULL | InnoDB purge worker      | NULL                                                                                                                              |    0.000 |
|   1 | system user |                       | NULL    | Daemon  |  NULL | InnoDB purge coordinator | NULL                                                                                                                              |    0.000 |
|   3 | system user |                       | NULL    | Daemon  |  NULL | InnoDB purge worker      | NULL                                                                                                                              |    0.000 |
|   4 | system user |                       | NULL    | Daemon  |  NULL | InnoDB purge worker      | NULL                                                                                                                              |    0.000 |
|   5 | system user |                       | NULL    | Daemon  |  NULL | InnoDB shutdown handler  | NULL                                                                                                                              |    0.000 |
| 157 | root        | 192.168.107.244:57366 | tpcds_1 | Query   | 62673 | Sending data             | with ws_wh as
(select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_ |    0.000 |
| 160 | root        | localhost             | foo     | Query   |     0 | init                     | show processlist                                                                                                                  |    0.000 |
+-----+-------------+-----------------------+---------+---------+-------+--------------------------+----------------------------                             --------------------------------------------------------------------------+----------+
7 rows in set (0.00 sec)



 Comments   
Comment by Sergei Petrunia [ 2018-01-13 ]

Looking at the stack trace, it seems like it is not hanging, but rather executing a query:

Thread 3 (Thread 0x7f0351148700 (LWP 29644)):
#0  JOIN_CACHE::read_record_field (this=this@entry=0x7f02da9ae908, copy=copy@entry=0x7f02e5bfd5b0, blob_in_rec_buff=blob_in_rec_buff@entry=false) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_join_cache.cc:1863
#1  0x00007f036a040c52 in JOIN_CACHE::read_all_record_fields (this=this@entry=0x7f02da9ae908) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_join_cache.cc:1742
#2  0x00007f036a040ce7 in JOIN_CACHE::get_record (this=0x7f02da9ae908) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_join_cache.cc:1609
#3  0x00007f036a04177e in JOIN_CACHE::join_matching_records (this=0x7f02da9ae908, skip_last=false) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_join_cache.cc:2291
#4  0x00007f036a04120a in JOIN_CACHE::join_records (this=0x7f02da9ae908, skip_last=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_join_cache.cc:2087
#5  0x00007f0369f8326b in evaluate_join_record (join=join@entry=0x7f02da19a630, join_tab=join_tab@entry=0x7f02e34cd2f8, error=error@entry=0) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18874
#6  0x00007f0369f8a773 in sub_select (join=0x7f02da19a630, join_tab=0x7f02e34cd2f8, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18693
#7  0x00007f036a04fff7 in join_tab_execution_startup (tab=tab@entry=0x7f02e02fe468) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/opt_subselect.cc:5294
#8  0x00007f0369f8a6f6 in sub_select (join=0x7f02da19a630, join_tab=0x7f02e02fe468, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18642
#9  0x00007f0369f8326b in evaluate_join_record (join=join@entry=0x7f02da19a630, join_tab=join_tab@entry=0x7f02e02fe0b8, error=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18874
#10 0x00007f0369f8a72e in sub_select (join=0x7f02da19a630, join_tab=0x7f02e02fe0b8, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18654
#11 0x00007f0369f8326b in evaluate_join_record (join=join@entry=0x7f02da19a630, join_tab=join_tab@entry=0x7f02e02fdd08, error=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18874
#12 0x00007f0369f8a72e in sub_select (join=0x7f02da19a630, join_tab=0x7f02e02fdd08, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18654
#13 0x00007f0369f8326b in evaluate_join_record (join=join@entry=0x7f02da19a630, join_tab=join_tab@entry=0x7f02e02fd958, error=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18874
#14 0x00007f0369f8a72e in sub_select (join=0x7f02da19a630, join_tab=0x7f02e02fd958, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18654
#15 0x00007f0369f8326b in evaluate_join_record (join=join@entry=0x7f02da19a630, join_tab=join_tab@entry=0x7f02e02fd5a8, error=error@entry=0) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18874
#16 0x00007f0369f8a773 in sub_select (join=0x7f02da19a630, join_tab=0x7f02e02fd5a8, end_of_records=<optimized out>) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18693
#17 0x00007f0369fa8662 in do_select (procedure=<optimized out>, join=0x7f02da19a630) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:18198
#18 JOIN::exec_inner (this=this@entry=0x7f02da19a630) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:3530
#19 0x00007f0369fa8869 in JOIN::exec (this=this@entry=0x7f02da19a630) at /home/zdravelina.sokolovska/mariadb-10.2.12/sql/sql_select.cc:3325

Comment by Sergei Petrunia [ 2018-01-13 ]

It could produce an impression of being hung if the query plan is inefficient and so the query examines lots of rows, which takes a lot of time.

Comment by Sergei Petrunia [ 2018-01-13 ]

One way to check this : instead of SHOW PROCESSLIST, run select * from information_schema.processlist. It will include EXAMINED_ROWS column. Do several selects from information_schema.processlist and see how many records per second it examines.

Comment by Sergei Petrunia [ 2018-01-13 ]

note2 : the hanging query does no cause the problem ; after killing and rerun only that query is does not stick

This one is interesting, something I dont have a ready explanation for...

Comment by Sergei Petrunia [ 2018-01-17 ]

winstone, can you provide details how does one get that 1G data? (I assume it's less than 1G if compressed, if so, can you upload the data dump somewhere? That would be the fastest option)

data even if compressed fails to be attached there

Comment by Sergei Petrunia [ 2018-01-17 ]

Another thing to check is SHOW EXPLAIN FOR for the running query.

note2 : the hanging query does no cause the problem ; after killing and rerun only that query is does not stick

This could be a background statistics update in MyRocks, which causes a different query plan. Does this (run query the first time - it hangs, kill it and run it again - it works) happen every time you try it?

Comment by Zdravelina Sokolovska (Inactive) [ 2018-01-19 ]

attached information_schema.processlist and show explain ,thist time hanging at query 95 ;
interesting previous time that query finished with correct result :
order count total shipping cost total net profit
73 120440.34 42133.12
but after rerun it remained more than 20h ;

Comment by Sergei Petrunia [ 2018-01-22 ]

winstone, thanks for the info!

Reading... so, the query is

with ws_wh as
(
  select 
    ws1.ws_order_number,
    ws1.ws_warehouse_sk wh1,
    ws2.ws_warehouse_sk wh2
  from 
    web_sales ws1,
    web_sales ws2
  where 
    ws1.ws_order_number = ws2.ws_order_number
    and 
    ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk
)
 
select
   count(distinct ws_order_number) as "order count"
  ,sum(ws_ext_ship_cost) as "total shipping cost"
  ,sum(ws_net_profit) as "total net profit"
from
   web_sales ws1
  ,date_dim
  ,customer_address
  ,web_site
where
    d_date between '1999-2-01' and
                    ADDDATE(cast('1999-2-01' as date),60)
    and ws1.ws_ship_date_sk = d_date_sk
    and ws1.ws_ship_addr_sk = ca_address_sk
    and ca_state = 'IL'
    and ws1.ws_web_site_sk = web_site_sk
    and web_company_name = 'pri'
    and ws1.ws_order_number in (select ws_order_number
                                from ws_wh)
    and ws1.ws_order_number in (select wr_order_number
                                from web_returns,ws_wh
                                where wr_order_number = ws_wh.ws_order_number)
order by count(distinct ws_order_number)
limit 100

INFORMATION_SCHEMA.processlist.examined_rows is zero. This is actually expected - I've checked and it turns out that column is only updated by UPDATE and DELETE statements. We are running a SELECT...

Comment by Sergei Petrunia [ 2018-01-22 ]

The provided EXPLAIN is:

+------+--------------+------------------+--------+---------------+--------------+---------+-----------------------------+--------+--------------------------------------------------------+
| id   | select_type  | table            | type   | possible_keys | key          | key_len | ref                         | rows   | Extra                                                  |
+------+--------------+------------------+--------+---------------+--------------+---------+-----------------------------+--------+--------------------------------------------------------+
|    1 | PRIMARY      | ws1              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 | Using where                                            |
|    1 | PRIMARY      | customer_address | eq_ref | PRIMARY       | PRIMARY      | 4       | tpcds_1.ws1.ws_ship_addr_sk |      1 | Using where                                            |
|    1 | PRIMARY      | date_dim         | eq_ref | PRIMARY       | PRIMARY      | 4       | tpcds_1.ws1.ws_ship_date_sk |      1 | Using where                                            |
|    1 | PRIMARY      | web_site         | eq_ref | PRIMARY       | PRIMARY      | 4       | tpcds_1.ws1.ws_web_site_sk  |      1 | Using where                                            |
|    1 | PRIMARY      | <subquery3>      | eq_ref | distinct_key  | distinct_key | 4       | func                        |      1 |                                                        |
|    1 | PRIMARY      | <subquery4>      | eq_ref | distinct_key  | distinct_key | 4       | func                        |      1 |                                                        |
|    3 | MATERIALIZED | ws1              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 |                                                        |
|    3 | MATERIALIZED | ws2              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 | Using where; Using join buffer (flat, BNL join)        |
|    4 | MATERIALIZED | web_returns      | index  | NULL          | PRIMARY      | 8       | NULL                        |  71763 | Using index                                            |
|    4 | MATERIALIZED | ws1              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 | Using where; Using join buffer (flat, BNL join)        |
|    4 | MATERIALIZED | ws2              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 | Using where; Using join buffer (incremental, BNL join) |
+------+--------------+------------------+--------+---------------+--------------+---------+-----------------------------+--------+--------------------------------------------------------+

This query plan is not fast. Take a look at this for example:

|    3 | MATERIALIZED | ws1              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 |                                                        |
|    3 | MATERIALIZED | ws2              | ALL    | NULL          | NULL         | NULL    | NULL                        | 719384 | Using where; Using join buffer (flat, BNL join)        |

computing a cross-self-join for a table with 700K rows is not going to be fast.

Q1: The query actually has a restriction: ws1.ws_order_number = ws2.ws_order_number. Is there no index on that column?

Q2: So this is a mergeable CTE that's used from the two mergeable semi-join subqueries. Why wasn't it merged, and we see select_id=3 and 4 in the EXPLAIN? (winstone are you using non-standard @@optimizer_switch settings?)

Comment by Sergei Petrunia [ 2018-01-22 ]

winstone, more questions:

  • if you just run the query manually, does it execute fast or slow?
  • in case the answer to this is "fast", can you run {{ANALYZE SELECT .... }} for the query and post the output here?
Comment by Zdravelina Sokolovska (Inactive) [ 2018-01-23 ]

Q1:
ws_order_number is defined as primary key

create table web_sales
(
    ws_sold_date_sk           integer                       ,
    ws_sold_time_sk           integer                       ,
    ws_ship_date_sk           integer                       ,
    ws_item_sk                integer               not null,
    ws_bill_customer_sk       integer                       ,
    ws_bill_cdemo_sk          integer                       ,
    ws_bill_hdemo_sk          integer                       ,
    ws_bill_addr_sk           integer                       ,
    ws_ship_customer_sk       integer                       ,
    ws_ship_cdemo_sk          integer                       ,
    ws_ship_hdemo_sk          integer                       ,
    ws_ship_addr_sk           integer                       ,
    ws_web_page_sk            integer                       ,
    ws_web_site_sk            integer                       ,
    ws_ship_mode_sk           integer                       ,
    ws_warehouse_sk           integer                       ,
    ws_promo_sk               integer                       ,
    ws_order_number           integer               not null,
    ws_quantity               integer                       ,
    ws_wholesale_cost         decimal(7,2)                  ,
    ws_list_price             decimal(7,2)                  ,
    ws_sales_price            decimal(7,2)                  ,
    ws_ext_discount_amt       decimal(7,2)                  ,
    ws_ext_sales_price        decimal(7,2)                  ,
    ws_ext_wholesale_cost     decimal(7,2)                  ,
    ws_ext_list_price         decimal(7,2)                  ,
    ws_ext_tax                decimal(7,2)                  ,
    ws_coupon_amt             decimal(7,2)                  ,
    ws_ext_ship_cost          decimal(7,2)                  ,
    ws_net_paid               decimal(7,2)                  ,
    ws_net_paid_inc_tax       decimal(7,2)                  ,
    ws_net_paid_inc_ship      decimal(7,2)                  ,
    ws_net_paid_inc_ship_tax  decimal(7,2)                  ,
    ws_net_profit             decimal(7,2)                  ,
    primary key (ws_item_sk, ws_order_number)
);

Q2: the optimizer_switch is the default one :
select @@optimizer_switch;

index_merge=on
index_merge_union=on
index_merge_sort_union=on
index_merge_intersection=on
index_merge_sort_intersection=off
engine_condition_pushdown=off
index_condition_pushdown=on
derived_merge=on
derived_with_keys=on
firstmatch=on
loosescan=on
materialization=on
in_to_exists=on
semijoin=on
partial_match_rowid_merge=on
partial_match_table_scan=on
subquery_cache=on
mrr=off
mrr_cost_based=off
mrr_sort_keys=off
outer_join_with_cache=on
semijoin_with_cache=on
join_cache_incremental=on
join_cache_hashed=on
join_cache_bka=on
optimize_join_buffer_size=off
table_elimination=on
extended_keys=on
exists_to_in=on
orderby_uses_equalities=on
condition_pushdown_for_derived=on

have not yet exec time from the consecutive run of query 95 ; when run manually query 95 was done for 1 day 6 hours 19 min
------------------------------------------------

order count total shipping cost total net profit

------------------------------------------------

73 120440.34 42133.12

------------------------------------------------
1 row in set (1 day 6 hours 19 min 16.09 sec)

because the query is with 'with' construction analyze with returns error ; will be
analyze select * from (the_query)ff ; useful or there is some other analyze specific for 'with' select ?

Generated at Thu Feb 08 08:17:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.