[MDEV-33817] Implement AVX512BW and VPCLMULQDQ based CRC-32 algorithms - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.11
Fix Version/s: 10.5.25, 10.6.18, 10.11.8, 11.0.6, 11.2.4, 11.1.5, 11.4.2
Component/s: Server, Storage Engine - InnoDB
Labels:
- performance
Environment:
AMD64

Description

~~MDEV-22749~~ introduced faster checksum calculation on AMD64 by means of the 128-bit carry-less multiplication (pclmul) of the AVX ISA extension. Some recent ISA extensions include wider vpclmulqdq as well as vmovdqu8, which allows unaligned loads of up to 64 bytes at a time. Making use of these instructions could greatly improve performance.

Some reference implementations exist in NASM format in https://github.com/intel/intel-ipsec-mb/ and https://github.com/intel/isa-l/ under the BSD 3-clause license. CRC-32 with the ISO 3309 polynomical is called gzip, and the Castagnoli polynomial is referred to as SCTP or iSCSI.

Attachments

Issue Links

causes

MDEV-34510 UBSAN: crc32 x86 - integer overflow

Closed

MDEV-34565 MariaDB crashes with SIGILL because the OS does not support AVX512

Closed

MDEV-36774 mbstream is not restoring backup properly: xb_stream_read_chunk(): invalid checksum at offset 0x3056

Needs Feedback

is blocked by

MDEV-22749 Implement portable PCLMUL accelerated crc32() with Intel intrinsics

Closed

relates to

MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0

Closed

MDEV-34070 mariadb-backup --prepare with larger --use-memory may be slower

Confirmed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-04-05 12:44

Starting with GCC 8, clang 6, and MSVC 19.15, the following test program compiles into something that includes the vpclmulqdq instruction:

#include <immintrin.h>

#ifdef __GNUC__

__attribute__((target("avx512f,vpclmulqdq")))

#endif

unsigned f()

  __m512i a= _mm512_setzero_si512();

  return _mm512_reduce_add_epi32(_mm512_clmulepi64_epi128(a, a, 0x42));

Marko Mäkelä added a comment - 2024-04-05 12:44 Starting with GCC 8, clang 6, and MSVC 19.15, the following test program compiles into something that includes the vpclmulqdq instruction: #include <immintrin.h> #ifdef __GNUC__ __attribute__((target( "avx512f,vpclmulqdq" ))) #endif unsigned f() { __m512i a= _mm512_setzero_si512(); return _mm512_reduce_add_epi32(_mm512_clmulepi64_epi128(a, a, 0x42)); }

Marko Mäkelä added a comment - 2024-04-11 09:48

I believe that https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm can compute any CRC up to CRC-32 on arbitrarily aligned and sized buffers. In the library there are multiple routines that pass different tables to this function, based on the GF(2) polynomial. https://en.wikipedia.org/wiki/Carry-less_product helps understanding this. Thanks to the vmovdqu8 instruction, alignment to cache line boundaries is not a problem. The first loop is processing 256 bytes at a time, and subsequent loops process any remaining parts rest. Any CRC related computations are performed with 512-bit vpclmuldqd instructions.

I think that as a first cut, it should be easiest to implement this in inline assembler for GCC and clang and similar compilers. Later on, it could be ported to the intrinsic functions so that it would also work on MSVC.

Marko Mäkelä added a comment - 2024-04-11 09:48 I believe that https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm can compute any CRC up to CRC-32 on arbitrarily aligned and sized buffers. In the library there are multiple routines that pass different tables to this function, based on the GF(2) polynomial. https://en.wikipedia.org/wiki/Carry-less_product helps understanding this. Thanks to the vmovdqu8 instruction, alignment to cache line boundaries is not a problem. The first loop is processing 256 bytes at a time, and subsequent loops process any remaining parts rest. Any CRC related computations are performed with 512-bit vpclmuldqd instructions. I think that as a first cut, it should be easiest to implement this in inline assembler for GCC and clang and similar compilers. Later on, it could be ported to the intrinsic functions so that it would also work on MSVC.

Marko Mäkelä added a comment - 2024-04-12 15:49

I ended up converting the assembler code to something that uses the intrinsic functions. I am happy to see that this allows extensive compile-time type checking, as well as checking that the necessary target attributes (avx512bw,avx512vl,avx512f,avx512dq,vpclmulqdq) have been specified.

While doing this, I learned that the 128-bit XMM register file is aliasing the 256-bit YMM register file, which in turn is aliasing the 512-bit ZMM register file. The assembler syntax was somewhat confusing. Sometimes also the 128-bit pclmulqdq instruction is called vpclmulqdq.

Marko Mäkelä added a comment - 2024-04-12 15:49 I ended up converting the assembler code to something that uses the intrinsic functions. I am happy to see that this allows extensive compile-time type checking, as well as checking that the necessary target attributes ( avx512bw,avx512vl,avx512f,avx512dq,vpclmulqdq ) have been specified. While doing this, I learned that the 128-bit XMM register file is aliasing the 256-bit YMM register file, which in turn is aliasing the 512-bit ZMM register file. The assembler syntax was somewhat confusing. Sometimes also the 128-bit pclmulqdq instruction is called vpclmulqdq .

Marko Mäkelä added a comment - 2024-04-15 08:21

_mm_loadu_epi64(), which wraps the AVX512BW instruction vmovdqu, is only available starting with GCC 11 or clang 8.

Marko Mäkelä added a comment - 2024-04-15 08:21 _mm_loadu_epi64() , which wraps the AVX512BW instruction vmovdqu , is only available starting with GCC 11 or clang 8.

Marko Mäkelä added a comment - 2024-04-26 14:09

Today, I got this working for message lengths up to 16 bytes with both CRC-32 polynomials. There are two variants of the algorithm: for reflected polynomials (crc32_refl_by16_vclmul_avx512) and a slightly more complex one for non-reflected ones (crc32_by16_vclmul_avx512). The reflected one matches our CRC-32 and CRC-32C functions.

Marko Mäkelä added a comment - 2024-04-26 14:09 Today, I got this working for message lengths up to 16 bytes with both CRC-32 polynomials. There are two variants of the algorithm: for reflected polynomials ( crc32_refl_by16_vclmul_avx512 ) and a slightly more complex one for non-reflected ones ( crc32_by16_vclmul_avx512 ). The reflected one matches our CRC-32 and CRC-32C functions.

Marko Mäkelä added a comment - 2024-04-29 15:05

I finally started to get correct results today. The last bit was that when built with clang (any version between 8 and 18 that I tried), I would get wrong results for lengths above 64. In the end, I came up with the following tweak in order to avoid suboptimal code on GCC:

static inline __m512i shrl512_384(__m512i a)

#if defined __GNUC__ && __GNUC__ >= 11

  /* While technically incorrect, this would seem to translate into a

  vextracti32x4 instruction, which actually outputs a ZMM register

  (anything above the XMM range is cleared). */

  return _mm512_castsi128_si512(_mm512_extracti64x2_epi64(a, 3));

#else

  /* On clang, this is needed in order to get a correct result. */

  return _mm512_maskz_shuffle_i64x2(3, a, a, 3);

#endif

Marko Mäkelä added a comment - 2024-04-29 15:05 I finally started to get correct results today. The last bit was that when built with clang (any version between 8 and 18 that I tried), I would get wrong results for lengths above 64. In the end, I came up with the following tweak in order to avoid suboptimal code on GCC: static inline __m512i shrl512_384(__m512i a) { #if defined __GNUC__ && __GNUC__ >= 11 /* While technically incorrect, this would seem to translate into a vextracti32x4 instruction, which actually outputs a ZMM register (anything above the XMM range is cleared). */ return _mm512_castsi128_si512(_mm512_extracti64x2_epi64(a, 3)); #else /* On clang, this is needed in order to get a correct result. */ return _mm512_maskz_shuffle_i64x2(3, a, a, 3); #endif }

Marko Mäkelä added a comment - 2024-04-30 12:58

I created a stand-alone version of this as well, at https://github.com/dr-m/crc32_simd/, and I tested it on both x86 (IA-32) and x86-64 (AMD64). The stand-alone version implements CRC-32 and CRC-32C for both reflected and non-reflected polynomials. MySQL and MariaDB always used the reflected polynomials, because in that way the calculations are a little simpler.

Marko Mäkelä added a comment - 2024-04-30 12:58 I created a stand-alone version of this as well, at https://github.com/dr-m/crc32_simd/ , and I tested it on both x86 (IA-32) and x86-64 (AMD64). The stand-alone version implements CRC-32 and CRC-32C for both reflected and non-reflected polynomials. MySQL and MariaDB always used the reflected polynomials, because in that way the calculations are a little simpler.

Marko Mäkelä added a comment - 2024-05-03 09:52

I tested this a little on the current top of the MariaDB Server 10.11 branch and a 144-thread system running Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz on Ubuntu 20.04, using GCC 13.1.0 on the 5.4.0-90-generic kernel.

My test scenario was as follows:

Start a Sysbench oltp_update_index workload with 144 concurrent client connections, on 32 tables, 10000 rows each, 10GiB of buffer pool and log file size, so that there will be no log checkpoint during the 60-second workload.
About 40 seconds into the benchmark workload, start the backup; I had to specify --innodb-log-buffer-size=512m due to ~~MDEV-34062~~.
Shut down the server. The backup contains a 2.6 GiB ib_logfile0 and a 6.2 GiB backup in total.
Prepare the a copy of the backup (so that we can do this multiple times on the same data). I specified use_memory=1g and the maximum innodb_read_io_threads=64 and innodb_write_io_threads=64.

When I disabled the AVX512 checksum code, preparing the backup would finish in 103 seconds using the crc32_3way implementation. With the new crc32_avx512 it would complete in 110 seconds, both when run under perf record. In the perf report, the CRC-32C function accounted for the exact same share of samples in both runs: 1.57%.

I reran this a few more times, without perf record.

`crc32c_3way` real time/s	user/s	system/s	`crc32_avx512` real time/s	user/s	system/s
93.485	168.489	71.217	89.479	164.622	74.979
87.711	155.903	69.669	83.792	149.705	60.143
87.903	161.782	68.669	104.265	203.968	99.505
87.733	159.429	60.014	95.112	172.540	88.358

There is quite a bit of fluctuation in the numbers, considerably more with AVX512. In the perf report we can see plenty of context switching overhead and other bottlenecks; there definitely is room for improvement outside the CRC-32C calculation. If we take the minimum reported times, it does not look too bad: 83.792s/87.711s = 4.4% real time saved, or 149.705s/155.903s = 3.8% user CPU time saved.

I also ran a single-threaded test of computing a checksum on a 1 GiB buffer:

diff --git a/mysys/crc32/crc32c_x86.cc b/mysys/crc32/crc32c_x86.cc

index 86f0976492b..eaf21148320 100644

--- a/mysys/crc32/crc32c_x86.cc

+++ b/mysys/crc32/crc32c_x86.cc

@@ -368,7 +368,7 @@ static unsigned crc32_avx512(unsigned crc, const char *buf, size_t size,

 static ATTRIBUTE_NOINLINE int have_vpclmulqdq()

-{

+{return 0;

 # ifdef _MSC_VER

   int regs[4];

   __cpuidex(regs, 7, 0);

diff --git a/unittest/mysys/crc32-t.c b/unittest/mysys/crc32-t.c

index 7079aeb614a..a7a2d89a8f2 100644

--- a/unittest/mysys/crc32-t.c

+++ b/unittest/mysys/crc32-t.c

@@ -95,6 +95,7 @@ static const char STR[]=

 int main(int argc __attribute__((unused)),char *argv[])

   MY_INIT(argv[0]);

+#if 0

   init_lookup(tab_3309, 0xedb88320);

   init_lookup(tab_castagnoli, 0x82f63b78);

@@ -142,4 +143,7 @@ int main(int argc __attribute__((unused)),char *argv[])

   my_end(0);

   return exit_status();

+#else

+  return my_crc32c(0,mmap(0, 1<<30,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0),1<<30);

+#endif

For this, there was a more impressive improvement. The minimum reported user-space CPU times were 160 and 52 milliseconds.

Marko Mäkelä added a comment - 2024-05-03 09:52 I tested this a little on the current top of the MariaDB Server 10.11 branch and a 144-thread system running Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz on Ubuntu 20.04, using GCC 13.1.0 on the 5.4.0-90-generic kernel. My test scenario was as follows: Start a Sysbench oltp_update_index workload with 144 concurrent client connections, on 32 tables, 10000 rows each, 10GiB of buffer pool and log file size, so that there will be no log checkpoint during the 60-second workload. About 40 seconds into the benchmark workload, start the backup; I had to specify --innodb-log-buffer-size=512m due to MDEV-34062 . Shut down the server. The backup contains a 2.6 GiB ib_logfile0 and a 6.2 GiB backup in total. Prepare the a copy of the backup (so that we can do this multiple times on the same data). I specified use_memory=1g and the maximum innodb_read_io_threads=64 and innodb_write_io_threads=64 . When I disabled the AVX512 checksum code, preparing the backup would finish in 103 seconds using the crc32_3way implementation. With the new crc32_avx512 it would complete in 110 seconds, both when run under perf record . In the perf report , the CRC-32C function accounted for the exact same share of samples in both runs: 1.57%. I reran this a few more times, without perf record . crc32c_3way real time/s user/s system/s crc32_avx512 real time/s user/s system/s 93.485 168.489 71.217 89.479 164.622 74.979 87.711 155.903 69.669 83.792 149.705 60.143 87.903 161.782 68.669 104.265 203.968 99.505 87.733 159.429 60.014 95.112 172.540 88.358 There is quite a bit of fluctuation in the numbers, considerably more with AVX512. In the perf report we can see plenty of context switching overhead and other bottlenecks; there definitely is room for improvement outside the CRC-32C calculation. If we take the minimum reported times, it does not look too bad: 83.792s/87.711s = 4.4% real time saved, or 149.705s/155.903s = 3.8% user CPU time saved. I also ran a single-threaded test of computing a checksum on a 1 GiB buffer: diff --git a/mysys/crc32/crc32c_x86.cc b/mysys/crc32/crc32c_x86.cc index 86f0976492b..eaf21148320 100644 --- a/mysys/crc32/crc32c_x86.cc +++ b/mysys/crc32/crc32c_x86.cc @@ -368,7 +368,7 @@ static unsigned crc32_avx512(unsigned crc, const char *buf, size_t size, } static ATTRIBUTE_NOINLINE int have_vpclmulqdq() -{ +{return 0; # ifdef _MSC_VER int regs[4]; __cpuidex(regs, 7, 0); diff --git a/unittest/mysys/crc32-t.c b/unittest/mysys/crc32-t.c index 7079aeb614a..a7a2d89a8f2 100644 --- a/unittest/mysys/crc32-t.c +++ b/unittest/mysys/crc32-t.c @@ -95,6 +95,7 @@ static const char STR[]= int main(int argc __attribute__((unused)),char *argv[]) { MY_INIT(argv[0]); +#if 0 init_lookup(tab_3309, 0xedb88320); init_lookup(tab_castagnoli, 0x82f63b78); @@ -142,4 +143,7 @@ int main(int argc __attribute__((unused)),char *argv[]) my_end(0); return exit_status(); +#else + return my_crc32c(0,mmap(0, 1<<30,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0),1<<30); +#endif } For this, there was a more impressive improvement. The minimum reported user-space CPU times were 160 and 52 milliseconds.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-04-03 10:41

Updated:: Yesterday 18:26

Resolved:: 2024-05-03 13:29

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.