Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25449

Add MY_COLLATION_HANDLER::strnncollsp_nchars()

Details

    Description

      2022-01-21 Update

      This problem was solved under terms of MDEV-25904

      Old description

      Field_string::cmp() seems to do unnecessary work trimming trailing spaces:

      int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
      {
        size_t a_len, b_len;
       
        if (mbmaxlen() != 1)
        {
          size_t char_len= Field_string::char_length();
          a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
          b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
        }
        else
          a_len= b_len= field_length;
        /*
          We have to remove end space to be able to compare multi-byte-characters
          like in latin_de 'ae' and 0xe4
        */
        return field_charset()->strnncollsp(a_ptr, a_len,
                                            b_ptr, b_len);
      }
      

      In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

      A better approach would be to implement a new comparison function with this tentative API:

      int strnncollsp_nchars(CHARSET_INFO *cs,
                             const char *s1, size_t len1,
                             const char *s2, size_t len2,
                             size_t nchars);
      

      Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".

      This new function should also help to fix a bug in the similar code in InnoDB: see MDEV-25440 for details.

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov created issue -
            bar Alexander Barkov made changes -
            Field Original Value New Value
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ]
            bar Alexander Barkov made changes -
            Description Field_string::cmp() seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like a waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: seeMDEV-25440 for details.


            {{Field_string::cmp()}} seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: seeMDEV-25440 for details.


            bar Alexander Barkov made changes -
            Description {{Field_string::cmp()}} seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: seeMDEV-25440 for details.


            {{Field_string::cmp()}} seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: see MDEV-25440 for details.


            marko Marko Mäkelä made changes -
            Issue Type Task [ 3 ] Bug [ 1 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Affects Version/s 10.5.0 [ 23709 ]
            Affects Version/s 10.4.0 [ 23115 ]
            Affects Version/s 10.3.0 [ 22127 ]
            Affects Version/s 10.2.2 [ 22013 ]
            Affects Version/s 10.6.0 [ 24431 ]
            Labels corruption regression-10.2 tech_debt
            bar Alexander Barkov made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Eugene Kosov [ kevg ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.6 [ 24028 ]
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Assignee Eugene Kosov [ kevg ] Sergei Golubchik [ serg ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 121147 ] MariaDB v4 [ 143686 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            bar Alexander Barkov made changes -
            Description {{Field_string::cmp()}} seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: see MDEV-25440 for details.


            h2. 2022-01-21 Update

            This problem was solved under terms of MDEV-25904

            h2. Old description

            {{Field_string::cmp()}} seems to do unnecessary work trimming trailing spaces:

            {code:cpp}
            int Field_string::cmp(const uchar *a_ptr, const uchar *b_ptr) const
            {
              size_t a_len, b_len;

              if (mbmaxlen() != 1)
              {
                size_t char_len= Field_string::char_length();
                a_len= field_charset()->charpos(a_ptr, a_ptr + field_length, char_len);
                b_len= field_charset()->charpos(b_ptr, b_ptr + field_length, char_len);
              }
              else
                a_len= b_len= field_length;
              /*
                We have to remove end space to be able to compare multi-byte-characters
                like in latin_de 'ae' and 0xe4
              */
              return field_charset()->strnncollsp(a_ptr, a_len,
                                                  b_ptr, b_len);
            }
            {code}

            In absolute majority cases, the difference between strings is found in the very beginning of the compared strings. So doing charpos() on the two arguments, before passing them to the actual comparison function, looks like an inefficient waste of CPU.

            A better approach would be to implement a new comparison function with this tentative API:

            {code:cpp}
            int strnncollsp_nchars(CHARSET_INFO *cs,
                                   const char *s1, size_t len1,
                                   const char *s2, size_t len2,
                                   size_t nchars);
            {code}

            Internally, the exact virtial implementations of strnncollsp_nchars() would do the same with what strnncollsp() do in the same collation, but with an extra limit on "nchars".


            This new function should also help to fix a bug in the similar code in InnoDB: see MDEV-25440 for details.


            bar Alexander Barkov made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Resolution Duplicate [ 3 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.