Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33817

Implement AVX512BW and VPCLMULQDQ based CRC-32 algorithms

Details

    Description

      MDEV-22749 introduced faster checksum calculation on AMD64 by means of the 128-bit carry-less multiplication (pclmul) of the AVX ISA extension. Some recent ISA extensions include wider vpclmulqdq as well as vmovdqu8, which allows unaligned loads of up to 64 bytes at a time. Making use of these instructions could greatly improve performance.

      Some reference implementations exist in NASM format in https://github.com/intel/intel-ipsec-mb/ and https://github.com/intel/isa-l/ under the BSD 3-clause license. CRC-32 with the ISO 3309 polynomical is called gzip, and the Castagnoli polynomial is referred to as SCTP or iSCSI.

      Attachments

        Issue Links

          Activity

            Starting with GCC 8, clang 6, and MSVC 19.15, the following test program compiles into something that includes the vpclmulqdq instruction:

            #include <immintrin.h>
            #ifdef __GNUC__
            __attribute__((target("avx512f,vpclmulqdq")))
            #endif
            unsigned f()
            {
              __m512i a= _mm512_setzero_si512();
              return _mm512_reduce_add_epi32(_mm512_clmulepi64_epi128(a, a, 0x42));
            }
            

            marko Marko Mäkelä added a comment - Starting with GCC 8, clang 6, and MSVC 19.15, the following test program compiles into something that includes the vpclmulqdq instruction: #include <immintrin.h> #ifdef __GNUC__ __attribute__((target( "avx512f,vpclmulqdq" ))) #endif unsigned f() { __m512i a= _mm512_setzero_si512(); return _mm512_reduce_add_epi32(_mm512_clmulepi64_epi128(a, a, 0x42)); }

            I believe that https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm can compute any CRC up to CRC-32 on arbitrarily aligned and sized buffers. In the library there are multiple routines that pass different tables to this function, based on the GF(2) polynomial. https://en.wikipedia.org/wiki/Carry-less_product helps understanding this. Thanks to the vmovdqu8 instruction, alignment to cache line boundaries is not a problem. The first loop is processing 256 bytes at a time, and subsequent loops process any remaining parts rest. Any CRC related computations are performed with 512-bit vpclmuldqd instructions.

            I think that as a first cut, it should be easiest to implement this in inline assembler for GCC and clang and similar compilers. Later on, it could be ported to the intrinsic functions so that it would also work on MSVC.

            marko Marko Mäkelä added a comment - I believe that https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm can compute any CRC up to CRC-32 on arbitrarily aligned and sized buffers. In the library there are multiple routines that pass different tables to this function, based on the GF(2) polynomial. https://en.wikipedia.org/wiki/Carry-less_product helps understanding this. Thanks to the vmovdqu8 instruction, alignment to cache line boundaries is not a problem. The first loop is processing 256 bytes at a time, and subsequent loops process any remaining parts rest. Any CRC related computations are performed with 512-bit vpclmuldqd instructions. I think that as a first cut, it should be easiest to implement this in inline assembler for GCC and clang and similar compilers. Later on, it could be ported to the intrinsic functions so that it would also work on MSVC.

            I ended up converting the assembler code to something that uses the intrinsic functions. I am happy to see that this allows extensive compile-time type checking, as well as checking that the necessary target attributes (avx512bw,avx512vl,avx512f,avx512dq,vpclmulqdq) have been specified.

            While doing this, I learned that the 128-bit XMM register file is aliasing the 256-bit YMM register file, which in turn is aliasing the 512-bit ZMM register file. The assembler syntax was somewhat confusing. Sometimes also the 128-bit pclmulqdq instruction is called vpclmulqdq.

            marko Marko Mäkelä added a comment - I ended up converting the assembler code to something that uses the intrinsic functions. I am happy to see that this allows extensive compile-time type checking, as well as checking that the necessary target attributes ( avx512bw,avx512vl,avx512f,avx512dq,vpclmulqdq ) have been specified. While doing this, I learned that the 128-bit XMM register file is aliasing the 256-bit YMM register file, which in turn is aliasing the 512-bit ZMM register file. The assembler syntax was somewhat confusing. Sometimes also the 128-bit pclmulqdq instruction is called vpclmulqdq .

            _mm_loadu_epi64(), which wraps the AVX512BW instruction vmovdqu, is only available starting with GCC 11 or clang 8.

            marko Marko Mäkelä added a comment - _mm_loadu_epi64() , which wraps the AVX512BW instruction vmovdqu , is only available starting with GCC 11 or clang 8.

            Today, I got this working for message lengths up to 16 bytes with both CRC-32 polynomials. There are two variants of the algorithm: for reflected polynomials (crc32_refl_by16_vclmul_avx512) and a slightly more complex one for non-reflected ones (crc32_by16_vclmul_avx512). The reflected one matches our CRC-32 and CRC-32C functions.

            marko Marko Mäkelä added a comment - Today, I got this working for message lengths up to 16 bytes with both CRC-32 polynomials. There are two variants of the algorithm: for reflected polynomials ( crc32_refl_by16_vclmul_avx512 ) and a slightly more complex one for non-reflected ones ( crc32_by16_vclmul_avx512 ). The reflected one matches our CRC-32 and CRC-32C functions.

            I finally started to get correct results today. The last bit was that when built with clang (any version between 8 and 18 that I tried), I would get wrong results for lengths above 64. In the end, I came up with the following tweak in order to avoid suboptimal code on GCC:

            static inline __m512i shrl512_384(__m512i a)
            {
            #if defined __GNUC__ && __GNUC__ >= 11
              /* While technically incorrect, this would seem to translate into a
              vextracti32x4 instruction, which actually outputs a ZMM register
              (anything above the XMM range is cleared). */
              return _mm512_castsi128_si512(_mm512_extracti64x2_epi64(a, 3));
            #else
              /* On clang, this is needed in order to get a correct result. */
              return _mm512_maskz_shuffle_i64x2(3, a, a, 3);
            #endif
            }
            

            marko Marko Mäkelä added a comment - I finally started to get correct results today. The last bit was that when built with clang (any version between 8 and 18 that I tried), I would get wrong results for lengths above 64. In the end, I came up with the following tweak in order to avoid suboptimal code on GCC: static inline __m512i shrl512_384(__m512i a) { #if defined __GNUC__ && __GNUC__ >= 11 /* While technically incorrect, this would seem to translate into a vextracti32x4 instruction, which actually outputs a ZMM register (anything above the XMM range is cleared). */ return _mm512_castsi128_si512(_mm512_extracti64x2_epi64(a, 3)); #else /* On clang, this is needed in order to get a correct result. */ return _mm512_maskz_shuffle_i64x2(3, a, a, 3); #endif }

            I created a stand-alone version of this as well, at https://github.com/dr-m/crc32_simd/, and I tested it on both x86 (IA-32) and x86-64 (AMD64). The stand-alone version implements CRC-32 and CRC-32C for both reflected and non-reflected polynomials. MySQL and MariaDB always used the reflected polynomials, because in that way the calculations are a little simpler.

            marko Marko Mäkelä added a comment - I created a stand-alone version of this as well, at https://github.com/dr-m/crc32_simd/ , and I tested it on both x86 (IA-32) and x86-64 (AMD64). The stand-alone version implements CRC-32 and CRC-32C for both reflected and non-reflected polynomials. MySQL and MariaDB always used the reflected polynomials, because in that way the calculations are a little simpler.

            I tested this a little on the current top of the MariaDB Server 10.11 branch and a 144-thread system running Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz on Ubuntu 20.04, using GCC 13.1.0 on the 5.4.0-90-generic kernel.

            My test scenario was as follows:

            1. Start a Sysbench oltp_update_index workload with 144 concurrent client connections, on 32 tables, 10000 rows each, 10GiB of buffer pool and log file size, so that there will be no log checkpoint during the 60-second workload.
            2. About 40 seconds into the benchmark workload, start the backup; I had to specify --innodb-log-buffer-size=512m due to MDEV-34062.
            3. Shut down the server. The backup contains a 2.6 GiB ib_logfile0 and a 6.2 GiB backup in total.
            4. Prepare the a copy of the backup (so that we can do this multiple times on the same data). I specified use_memory=1g and the maximum innodb_read_io_threads=64 and innodb_write_io_threads=64.

            When I disabled the AVX512 checksum code, preparing the backup would finish in 103 seconds using the crc32_3way implementation. With the new crc32_avx512 it would complete in 110 seconds, both when run under perf record. In the perf report, the CRC-32C function accounted for the exact same share of samples in both runs: 1.57%.

            I reran this a few more times, without perf record.

            crc32c_3way real time/s user/s system/s crc32_avx512 real time/s user/s system/s
            93.485 168.489 71.217 89.479 164.622 74.979
            87.711 155.903 69.669 83.792 149.705 60.143
            87.903 161.782 68.669 104.265 203.968 99.505
            87.733 159.429 60.014 95.112 172.540 88.358

            There is quite a bit of fluctuation in the numbers, considerably more with AVX512. In the perf report we can see plenty of context switching overhead and other bottlenecks; there definitely is room for improvement outside the CRC-32C calculation. If we take the minimum reported times, it does not look too bad: 83.792s/87.711s = 4.4% real time saved, or 149.705s/155.903s = 3.8% user CPU time saved.

            I also ran a single-threaded test of computing a checksum on a 1 GiB buffer:

            diff --git a/mysys/crc32/crc32c_x86.cc b/mysys/crc32/crc32c_x86.cc
            index 86f0976492b..eaf21148320 100644
            --- a/mysys/crc32/crc32c_x86.cc
            +++ b/mysys/crc32/crc32c_x86.cc
            @@ -368,7 +368,7 @@ static unsigned crc32_avx512(unsigned crc, const char *buf, size_t size,
             }
             
             static ATTRIBUTE_NOINLINE int have_vpclmulqdq()
            -{
            +{return 0;
             # ifdef _MSC_VER
               int regs[4];
               __cpuidex(regs, 7, 0);
            diff --git a/unittest/mysys/crc32-t.c b/unittest/mysys/crc32-t.c
            index 7079aeb614a..a7a2d89a8f2 100644
            --- a/unittest/mysys/crc32-t.c
            +++ b/unittest/mysys/crc32-t.c
            @@ -95,6 +95,7 @@ static const char STR[]=
             int main(int argc __attribute__((unused)),char *argv[])
             {
               MY_INIT(argv[0]);
            +#if 0
               init_lookup(tab_3309, 0xedb88320);
               init_lookup(tab_castagnoli, 0x82f63b78);
             
            @@ -142,4 +143,7 @@ int main(int argc __attribute__((unused)),char *argv[])
             
               my_end(0);
               return exit_status();
            +#else
            +  return my_crc32c(0,mmap(0, 1<<30,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0),1<<30);
            +#endif
             }
            

            For this, there was a more impressive improvement. The minimum reported user-space CPU times were 160 and 52 milliseconds.

            marko Marko Mäkelä added a comment - I tested this a little on the current top of the MariaDB Server 10.11 branch and a 144-thread system running Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz on Ubuntu 20.04, using GCC 13.1.0 on the 5.4.0-90-generic kernel. My test scenario was as follows: Start a Sysbench oltp_update_index workload with 144 concurrent client connections, on 32 tables, 10000 rows each, 10GiB of buffer pool and log file size, so that there will be no log checkpoint during the 60-second workload. About 40 seconds into the benchmark workload, start the backup; I had to specify --innodb-log-buffer-size=512m due to MDEV-34062 . Shut down the server. The backup contains a 2.6 GiB ib_logfile0 and a 6.2 GiB backup in total. Prepare the a copy of the backup (so that we can do this multiple times on the same data). I specified use_memory=1g and the maximum innodb_read_io_threads=64 and innodb_write_io_threads=64 . When I disabled the AVX512 checksum code, preparing the backup would finish in 103 seconds using the crc32_3way implementation. With the new crc32_avx512 it would complete in 110 seconds, both when run under perf record . In the perf report , the CRC-32C function accounted for the exact same share of samples in both runs: 1.57%. I reran this a few more times, without perf record . crc32c_3way real time/s user/s system/s crc32_avx512 real time/s user/s system/s 93.485 168.489 71.217 89.479 164.622 74.979 87.711 155.903 69.669 83.792 149.705 60.143 87.903 161.782 68.669 104.265 203.968 99.505 87.733 159.429 60.014 95.112 172.540 88.358 There is quite a bit of fluctuation in the numbers, considerably more with AVX512. In the perf report we can see plenty of context switching overhead and other bottlenecks; there definitely is room for improvement outside the CRC-32C calculation. If we take the minimum reported times, it does not look too bad: 83.792s/87.711s = 4.4% real time saved, or 149.705s/155.903s = 3.8% user CPU time saved. I also ran a single-threaded test of computing a checksum on a 1 GiB buffer: diff --git a/mysys/crc32/crc32c_x86.cc b/mysys/crc32/crc32c_x86.cc index 86f0976492b..eaf21148320 100644 --- a/mysys/crc32/crc32c_x86.cc +++ b/mysys/crc32/crc32c_x86.cc @@ -368,7 +368,7 @@ static unsigned crc32_avx512(unsigned crc, const char *buf, size_t size, } static ATTRIBUTE_NOINLINE int have_vpclmulqdq() -{ +{return 0; # ifdef _MSC_VER int regs[4]; __cpuidex(regs, 7, 0); diff --git a/unittest/mysys/crc32-t.c b/unittest/mysys/crc32-t.c index 7079aeb614a..a7a2d89a8f2 100644 --- a/unittest/mysys/crc32-t.c +++ b/unittest/mysys/crc32-t.c @@ -95,6 +95,7 @@ static const char STR[]= int main(int argc __attribute__((unused)),char *argv[]) { MY_INIT(argv[0]); +#if 0 init_lookup(tab_3309, 0xedb88320); init_lookup(tab_castagnoli, 0x82f63b78); @@ -142,4 +143,7 @@ int main(int argc __attribute__((unused)),char *argv[]) my_end(0); return exit_status(); +#else + return my_crc32c(0,mmap(0, 1<<30,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0),1<<30); +#endif } For this, there was a more impressive improvement. The minimum reported user-space CPU times were 160 and 52 milliseconds.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.