I have no objections against pushing the feature as of c3f901d1a9 into the main branch and releasing it with 11.8.1.
However, given that it's supposed to be a performance tuning, proper performance tests could make sense here.
I ran concurrent random tests targeted for stability and different variations of MTR tests for functionality on a cloud machine equipped with
Architecture: aarch64
|
CPU op-mode(s): 32-bit, 64-bit
|
Byte Order: Little Endian
|
CPU(s): 32
|
On-line CPU(s) list: 0-31
|
Vendor ID: ARM
|
Model name: Neoverse-N1
|
Model: 1
|
Thread(s) per core: 1
|
Core(s) per socket: 32
|
Socket(s): 1
|
Stepping: r3p1
|
BogoMIPS: 243.75
|
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
|
ssbs
|
The patch coverage is full, both in concurrent tests and MTR, so the environment is apparently suitable. I don't have access to other aarch64 flavors, e.g. those with the literal neon if there are any. The assumption is that the change will be applicable to all ARMs v8 and up, and it will be properly skipped for v7 as it's not aarch64.
Neither concurrent tests nor MTR revealed any patch-specific problems.
Neither old nor new buildbot also shows anything specific to the feature branch.
I also tried to run some MTR tests, mostly main.vector, with high --repeat, on the patch vs baseline (release builds), as a poor man's performance check to see if it shows any difference in the execution time, but didn't get anything noticeable.
There are two possible implementation of dot_product() vmull+vmull_high and vmull+vmlal2_high implementations. There seem to be no performance difference between them.
d+= vaddlvq_s32(vmlal_high_s16(vmull_s16(vget_low_s16(p1),
vget_low_s16(p2)), p1, p2));
d+= vaddlvq_s32(vmull_s16(vget_low_s16(p1), vget_low_s16(p2))) +
vaddlvq_s32(vmull_high_s16(p1, p2));