Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27472

ANALYZE: r_filtered=100 may look confusing when r_rows=0

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL)
    • 10.5, 10.6
    • Optimizer
    • None

    Description

      --source include/have_sequence.inc
       
      CREATE TABLE t (a INT, b INT, KEY(a)) ENGINE=MyISAM;
      INSERT INTO t SELECT 5, 5 FROM seq_1_to_1000;
       
      ANALYZE TABLE t PERSISTENT FOR ALL;
       
      ANALYZE FORMAT=JSON
      SELECT * FROM t WHERE 100 = b AND a = 100 OR a IS NULL AND a > 100;
       
      # Cleanup
      DROP TABLE t;
      

      428b057e

      {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 0.021893114,
          "table": {
            "table_name": "t",
            "access_type": "ref_or_null",
            "possible_keys": ["a"],
            "key": "a",
            "key_length": "5",
            "used_key_parts": ["a"],
            "ref": ["const"],
            "r_loops": 1,
            "rows": 2000,
            "r_rows": 0,
            "r_table_time_ms": 0.014110678,
            "r_other_time_ms": 0.005044542,
            "filtered": 99.21875,
            "r_filtered": 100,
            "index_condition": "t.a = 100",
            "attached_condition": "t.b = 100"
          }
        }
      }
      

      So, neither the WHERE condition as a whole nor any part of it can keep any rows, and yet r_filtered is 100%.

      Also, it's unclear how the plan comes up with rows==2000 while the table has 1000 rows.

      Initially the issue popped up in the course of MDEV-26519 testing, when it turned out that filtered rows with JSON_HB (close to 0) was very far from both r_filtered (100) and from filtered with DOUBLE_PREC_HB (close to 100). However, I'm not sure it's even a JSON_HB problem, given that r_filtered=100% is hard to understand here. Please feel free to adjust the issue as needed: close if not a bug (but please with an explanation how to interpret r_filtered value in this case), or handle as a maintenance problem, or make it MDEV-26519-related if it is.

      preview-10.8-MDEV-26519-json-histograms 9a86900b

      SET histogram_type=JSON_HB;
      ANALYZE TABLE t PERSISTENT FOR ALL;
      Table	Op	Msg_type	Msg_text
      test.t	analyze	status	Engine-independent statistics collected
      test.t	analyze	status	Table is already up to date
      ANALYZE FORMAT=JSON
      SELECT * FROM t WHERE 100 = b AND a = 100 OR a IS NULL AND a > 100;
      ANALYZE
      {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 0.109424208,
          "nested_loop": [
            {
              "table": {
                "table_name": "t",
                "access_type": "ref_or_null",
                "possible_keys": ["a"],
                "key": "a",
                "key_length": "5",
                "used_key_parts": ["a"],
                "ref": ["const"],
                "r_loops": 1,
                "rows": 2000,
                "r_rows": 0,
                "r_table_time_ms": 0.072815926,
                "r_other_time_ms": 0.02403827,
                "filtered": 0.100000001,
                "r_filtered": 100,
                "index_condition": "t.a = 100",
                "attached_condition": "t.b = 100"
              }
            }
          ]
        }
      }
      

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia added a comment - - edited

            So, neither the WHERE condition as a whole nor any part of it can keep any rows, and yet r_filtered is 100%.

            Note that r_rows=0. This means that the access method has not returned any rows. Hence the condition was never checked, and in this case, we assume filtered=100%.

            (In case we want to discuss this: the logic is in class Table_access_tracker:

              double get_filtered_after_where() const
              {
                return r_rows > 0
                  ? static_cast<double>(r_rows_after_where) /
                    static_cast<double>(r_rows)
                  : 1.0;
              }
            

            we could print e.g. null instead.

            psergei Sergei Petrunia added a comment - - edited So, neither the WHERE condition as a whole nor any part of it can keep any rows, and yet r_filtered is 100%. Note that r_rows=0. This means that the access method has not returned any rows. Hence the condition was never checked, and in this case, we assume filtered=100%. (In case we want to discuss this: the logic is in class Table_access_tracker : double get_filtered_after_where() const { return r_rows > 0 ? static_cast < double >(r_rows_after_where) / static_cast < double >(r_rows) : 1.0; } we could print e.g. null instead.

            Also, it's unclear how the plan comes up with rows==2000 while the table has 1000 rows.

            Looking at the trace

                              "chosen_range_access_summary": {
                                "range_access_plan": {
                                  "type": "range_scan",
                                  "index": "a",
                                  "rows": 1,
                                  "ranges": ["(100) <= (a) <= (100)"]
                                },
            

            Ok so far...

                            "plan_prefix": [],
                            "table": "t",
                            "best_access_path": {
                              "considered_access_paths": [
                                {
                                  "access_type": "ref_or_null",
                                  "index": "a",
                                  "rows": 2000,
                                  "cost": 14.17158897,
                                  "chosen": true
                                },
            

            We get the value of 2000 when we estimate #rows for ref_or_null.

            ref_or_null normally reuses the #rows estimate from the range optimizer.
            For this query, it didn't, because it saw that the range optimizer did not generate the two ranges.

            Indeed, as we saw, the range optimizer has generated one range:

                                  "ranges": ["(100) <= (a) <= (100)"]
            

            This is because the WHERE was (I've added the brackets):

            WHERE 
               (100 = b AND a = 100) OR 
               (a IS NULL AND a > 100)
            

            range optimizer was able to infer that the second line is always false. ref optimizer was not able to do that (as it doesn't analyze the "a>100" (also, the range optimizer actually has removed the second OR-part but ref optimizer wasn't able to take advantage of this as it runs before that is done)).

            Anyway, so ref optimizer is there and it only has index statistics to look at. index statistics says 1000 rows per key.

            then, this code is run which doubles the number due to NULL causing another value:

                          if (ref_or_null_part)
                          {
                            /* We need to do two key searches to find row */
                            records *= 2.0;
                          }
            

            This is sensible in general case, but in this case (table has rec_per_key > half_of_table), it doesn't.

            psergei Sergei Petrunia added a comment - Also, it's unclear how the plan comes up with rows==2000 while the table has 1000 rows. Looking at the trace "chosen_range_access_summary": { "range_access_plan": { "type": "range_scan", "index": "a", "rows": 1, "ranges": ["(100) <= (a) <= (100)"] }, Ok so far... "plan_prefix": [], "table": "t", "best_access_path": { "considered_access_paths": [ { "access_type": "ref_or_null", "index": "a", "rows": 2000, "cost": 14.17158897, "chosen": true }, We get the value of 2000 when we estimate #rows for ref_or_null. ref_or_null normally reuses the #rows estimate from the range optimizer. For this query, it didn't, because it saw that the range optimizer did not generate the two ranges. Indeed, as we saw, the range optimizer has generated one range: "ranges": ["(100) <= (a) <= (100)"] This is because the WHERE was (I've added the brackets): WHERE (100 = b AND a = 100) OR (a IS NULL AND a > 100) range optimizer was able to infer that the second line is always false. ref optimizer was not able to do that (as it doesn't analyze the "a>100" (also, the range optimizer actually has removed the second OR-part but ref optimizer wasn't able to take advantage of this as it runs before that is done)). Anyway, so ref optimizer is there and it only has index statistics to look at. index statistics says 1000 rows per key. then, this code is run which doubles the number due to NULL causing another value: if (ref_or_null_part) { /* We need to do two key searches to find row */ records *= 2.0; } This is sensible in general case, but in this case (table has rec_per_key > half_of_table), it doesn't.

            We should probably fix this, but I don't see this as urgent case.

            psergei Sergei Petrunia added a comment - We should probably fix this, but I don't see this as urgent case.

            Automated message:
            ----------------------------
            Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

            julien.fritsch Julien Fritsch added a comment - Automated message: ---------------------------- Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
            JIraAutomate JiraAutomate added a comment -

            Automated message:
            ----------------------------
            Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

            JIraAutomate JiraAutomate added a comment - Automated message: ---------------------------- Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

            People

              psergei Sergei Petrunia
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.