I tested this a little on the current top of the MariaDB Server 10.11 branch and a 144-thread system running Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz on Ubuntu 20.04, using GCC 13.1.0 on the 5.4.0-90-generic kernel.
My test scenario was as follows:
- Start a Sysbench oltp_update_index workload with 144 concurrent client connections, on 32 tables, 10000 rows each, 10GiB of buffer pool and log file size, so that there will be no log checkpoint during the 60-second workload.
- About 40 seconds into the benchmark workload, start the backup; I had to specify --innodb-log-buffer-size=512m due to
MDEV-34062.
- Shut down the server. The backup contains a 2.6 GiB ib_logfile0 and a 6.2 GiB backup in total.
- Prepare the a copy of the backup (so that we can do this multiple times on the same data). I specified use_memory=1g and the maximum innodb_read_io_threads=64 and innodb_write_io_threads=64.
When I disabled the AVX512 checksum code, preparing the backup would finish in 103 seconds using the crc32_3way implementation. With the new crc32_avx512 it would complete in 110 seconds, both when run under perf record. In the perf report, the CRC-32C function accounted for the exact same share of samples in both runs: 1.57%.
I reran this a few more times, without perf record.
crc32c_3way real time/s |
user/s |
system/s |
crc32_avx512 real time/s |
user/s |
system/s |
93.485 |
168.489 |
71.217 |
89.479 |
164.622 |
74.979 |
87.711 |
155.903 |
69.669 |
83.792 |
149.705 |
60.143 |
87.903 |
161.782 |
68.669 |
104.265 |
203.968 |
99.505 |
87.733 |
159.429 |
60.014 |
95.112 |
172.540 |
88.358 |
There is quite a bit of fluctuation in the numbers, considerably more with AVX512. In the perf report we can see plenty of context switching overhead and other bottlenecks; there definitely is room for improvement outside the CRC-32C calculation. If we take the minimum reported times, it does not look too bad: 83.792s/87.711s = 4.4% real time saved, or 149.705s/155.903s = 3.8% user CPU time saved.
I also ran a single-threaded test of computing a checksum on a 1 GiB buffer:
diff --git a/mysys/crc32/crc32c_x86.cc b/mysys/crc32/crc32c_x86.cc
|
index 86f0976492b..eaf21148320 100644
|
--- a/mysys/crc32/crc32c_x86.cc
|
+++ b/mysys/crc32/crc32c_x86.cc
|
@@ -368,7 +368,7 @@ static unsigned crc32_avx512(unsigned crc, const char *buf, size_t size,
|
}
|
|
static ATTRIBUTE_NOINLINE int have_vpclmulqdq()
|
-{
|
+{return 0;
|
# ifdef _MSC_VER
|
int regs[4];
|
__cpuidex(regs, 7, 0);
|
diff --git a/unittest/mysys/crc32-t.c b/unittest/mysys/crc32-t.c
|
index 7079aeb614a..a7a2d89a8f2 100644
|
--- a/unittest/mysys/crc32-t.c
|
+++ b/unittest/mysys/crc32-t.c
|
@@ -95,6 +95,7 @@ static const char STR[]=
|
int main(int argc __attribute__((unused)),char *argv[])
|
{
|
MY_INIT(argv[0]);
|
+#if 0
|
init_lookup(tab_3309, 0xedb88320);
|
init_lookup(tab_castagnoli, 0x82f63b78);
|
|
@@ -142,4 +143,7 @@ int main(int argc __attribute__((unused)),char *argv[])
|
|
my_end(0);
|
return exit_status();
|
+#else
|
+ return my_crc32c(0,mmap(0, 1<<30,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0),1<<30);
|
+#endif
|
}
|
For this, there was a more impressive improvement. The minimum reported user-space CPU times were 160 and 52 milliseconds.
Starting with GCC 8, clang 6, and MSVC 19.15, the following test program compiles into something that includes the vpclmulqdq instruction:
#include <immintrin.h>
#ifdef __GNUC__
#endif
unsigned f()
{
__m512i a= _mm512_setzero_si512();
}