[MCOL-4760] Unify hash implementation Created: 2021-06-15  Updated: 2022-10-05  Resolved: 2022-07-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr, PrimProc
Affects Version/s: 6.1.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roman Assignee: Roman
Resolution: Won't Do Votes: 0
Labels: performance

Issue Links:
Relates
relates to MCOL-4759 DISTINCT and UNION performance degrad... Closed
relates to MCOL-4243 Consolidate hash algorithms into one ... Open

 Description   

Plz take a look at the first comment

There are multiple implementation of the same functionality that must be shared: RowStorage::hashData(), Row::hash(), Row::colUpdateMariaDBHasher(), colUpdateMariaDBHasherTypeless().
They all contains the same or almost the same code that must be merged so that changes in the common hashing code are shared.

The benchmark tests from xxHash project demonstrates that MDB hash_bin_sort is slow in terms of hash calculation speed for arguments of different lenghts > 1 byte and its hash lacks avalance properties. So hash_bin_sort looks like a function mapping.

benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) 
MDBHSB ,   383,   381,   381,   380,   380,   381,   382,   382,   382,   382,   382,   382,   382,   382,   382,   381,   382,   382,   381
Throughput small inputs of fixed size (from 1 to 127 bytes): 
MDBHSB , 295024704, 221975236, 176417305, 145091235, 117256116,  96673542,  83958535,  73986830,  65592337,  57141176,  49911601,  44737604,  40951247,  37909058,  35393027,  32953202,  30992028,  28786576,  26987866,  25276406,  23892856,  22451620,  21232112,  20026773,  19072129,  18164413,  17314950,  16607676,  15879450,  15254507,  14668375,  14066414,  12932428,  13084167,  12695770,  12265419,  11952779,  10862721,  11256207,  10222532,  10624014,   9783401,  10056642,   9777877,   9571974,   9299894,   9110920,   8895325,   8690382,   8488004,   8308099,   8134366,   7926744,   7791090,   7626070,   7479251,   7317875,   7205873,   7065789,   6954038,   6784444,   6705094,   6595014,   6467644,   6357642,   6260518,   6152652,   6035240,   5948864,   5846997,   5733851,   5697956,   5628485,   5406450,   5353351,   5264790,   5206617,   5114699,   5053051,   4993530,   4935846,   4862844,   4801407,   4746994,   4690869,   4632266,   4572240,   4594217,   4539712,   4496971,   4434655,   4365478,   4309741,   4288916,   4134032,   4068177,   4022139,   3974391,   3963883,   3904032,   3869541,   3825133,   3787679,   3763346,   3723585,   3681054,   3661020,   3620390,   3578960,   3545808,   3509987,   3487164,   3449164,   3407913,   3387210,   3364107,   3319239,   3309640,   3274259,   3250296,   3214470,   3182862,   3163310,   3122483,   3119702,   3088232,   3066980

MDB's hash has a peculiar collision bench test output. All other hash functions that have avalance produce a way more collisions.

root@drrtuy-u20-3:/data/xxHash/tests/collisions# ./collisionsTest --nbh=100000000 MDBMHSB
 *** Collision tester for 64+ bit hashes ***  
 
Testing MDBMHSB algorithm (32-bit) 
This program will allocate a lot of memory,
generate 100000000 32-bit hashes from samples of 256 bytes, 
and attempt to produce 1164153 collisions. 
 
 Storing hash candidates (762 MB) 
 Stored 100000000 hash candidates in 1 mn 8s  
 Sorting candidates...  Completed in 14s  
 Looking for duplicates:  
Worst collision ratio at 0 high bits: x0.00 
 Completed in 1s  
 
 
*===>   Found 310 collisions (x0.00, 1164153.2 expected) in 1 mn 23s *
===>   MaxRSS(self) 0MB, MaxRSS(children) 0MB

The current MM3 works good for arguments less then 4 bytes. However it becomes slower comparing with xxh3 algo when the argument size rizes, e.g. 20% for 8 byte ints.

benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
untMM3 ,  2155,  2146,  2147,  2143,  2143,  2139,  2139,  2137,  2144,  2134,  2136,  2133,  2127,  2124,  2115,  2093,  2092,  2092,  2090
Throughput small inputs of fixed size (from 1 to 127 bytes): 
untMM3 , 204284868, 189941763, 187683043, 182659021, 161042711, 156535892, 150589280, 153351099, 137775956, 134420450, 130829023, 127511769, 116809979, 115851507, 112099070, 108798420,  98950005,  98777625,  95562293,  93433347,  86244149,  86587886,  84061656,  81636141,  77233807,  76733327,  76669764,  73881067,  69981605,  70070235,  69198969,  67117863,  64342817,  64408593,  62881163,  62322846,  59897757,  58350184,  56656276,  55414579,  53951981,  53166370,  51384075,  50595045,  46912379,  46872338,  45945803,  45131230,  42949705,  42113625,  41784170,  42067208,  40655509,  40212051,  38945872,  39705839,  38105405,  38196277,  36771063,  36819016,  35011767,  35868562,  36412883,  34272545,  33222025,  34771724,  34663263,  33860232,  31834301,  32748014,  32669688,  31576345,  30503543,  31189141,  31686580,  31044214,  29228665,  29938116,  30231256,  29125699,  28406851,  29013015,  29016415,  28254045,  27827256,  27593192,  28016709,  27246513,  26182600,  26799719,  26694069,  26033691,  25957675,  26168739,  25888978,  25351701,  25214387,  24911826,  25102338,  24546225,  24310765,  24368056,  24254358,  23994280,  23522953,  23664822,  23493898,  23232668,  23026452,  22849098,  22873393,  22430808,  22128842,  22069674,  21935954,  21855169,  21434595,  21546590,  21130468,  21189418,  20914317,  20681515,  20541955,  20506514,  19925455,  19937362,  19752999

Here is the stats for the proposed xxh3.

benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) 
xxh3   , 12195, 13933, 14246, 14637, 14661, 14847, 14845, 14882, 14879, 14870, 14995, 14867, 14763, 14718, 13570, 11421,  7505,  7115,  6951
Throughput small inputs of fixed size (from 1 to 127 bytes): 
xxh3   , 166328851, 166047117, 164750104, 222359743, 220859434, 222076138, 220839877, 222124593, 222410937, 222153846, 222561873, 222417751, 221551744, 222360951, 221106940, 222137297, 148433684, 148402082, 147931004, 147603247, 148186608, 148506960, 147230193, 147981851, 148233496, 148699782, 147402912, 148129418u, 148103169, 146850057, 148423203, 148411703, 102949607, 102925862, 102418241, 102952443, 102703214, 102907196, 102680279, 102889139, 102771179, 102832211, 102810108, 102866670, 102589097, 102200641, 103008151, 102449118, 102728269, 102587431, 102773045, 102686431, 102727251 , 102723369, 102729421, 102691531, 102830333, 102712080, 102590246, 102598743, 102992448, 102928081, 102657147, 102861148,  88892164,  88829285,  89086126,  89052115,  88852884,  89043314,  88769220,  89014418,  88356691,  89123644,  88574436,  89034979,  88701565,  89158585,  89123412,  89034997,  89076156,  88761990,  89213032,  89087844,  89147630,  89110953,  89248370,  87508480,  88759711,  88953844,  88634257,  88771926,  88888889,  89030197,  89178675,  89160849,  70232638,  70094447,  70366298,  70294825,  70395110,  70282285,  70282717,  70225001,  70365661,  70368055,  70293172,  69595835,  69942559,  70188431,  70137656,  70233269,  70026864,  70295104,  70063321,  70262560,  69990198,  70338534,  69965830,  70379123,  70409426,  70261515,  70024664,  70353988,  70142783,  70294010,  70315257

The modified code for the tests is available here. I also included the simple test that demonstrates the overal latency calculating hashes with our current MurMur3, MDB's hash_bin_sort, xxh3. The test code calls for xxHash static or dynamic library.

#include <iostream>
#include <bits/stdc++.h>
#include <chrono>
 
using int128_t = __int128;
 
#include "hasher.h"
#include "/data/xxHash/xxhash.h"
 
static constexpr size_t ITERS = 100000000LL;
 
using namespace std;
 
int main(int argc, char** argv)
{
    volatile uint32_t seed = 0;
    {
        utils::Hasher_r h;
        auto start = chrono::high_resolution_clock::now(); 
        for (size_t i = 0; i < ITERS; ++i)
        {
            seed = h((const char*)&i, sizeof(size_t), seed);
            seed = h.finalize(seed, 32);
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "Existing MurMur3 Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
    {
        auto start = chrono::high_resolution_clock::now(); 
        for (size_t i = 0; i < ITERS; ++i)
        {
            seed = XXH3_64bits_withSeed((const void*)&i, sizeof(size_t), seed);
            //seed = XXH64((const void*)&i, sizeof(size_t), seed);
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "XXH3_64 Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
    {
        auto start = chrono::high_resolution_clock::now(); 
        volatile size_t nr1 = 1;
        volatile size_t nr2 = 4;
        for (size_t i = 0; i < ITERS; ++i)
        {
            nr1 = utils::my_hash_sort_bin((const unsigned char*)&i, sizeof(size_t), nr1, nr2);
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "my_hash_sort_bin Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
 
}



 Comments   
Comment by Roman [ 2021-07-01 ]

According with the additional tests I run our MM3 implementation is preferable to xxh3 now. I used this code to compare the timings of a streaming hash calculation for 4 8bytes columns. Here are the timings:

root@drrtuy-u20-3:/data/hashes# LD_PRELOAD=/data/xxHash/libxxhash.so.0 /tmp/h
Existing MurMur3 Loop time is : 2.477903 sec
XXH3_64 Loop time is : 4.677573284 sec
my_hash_sort_bin Loop time is : 11.130369354 sec

The binary was compiled like this:

c++ -Wall --std=c++11 -O3 -L /data/xxHash/  ./hashspeedcomparator.cpp -l xxhash -o /tmp/h

Here is the updated code:

#include <iostream>
#include <bits/stdc++.h>
#include <chrono>
 
using int128_t = __int128;
 
#include "hasher.h"
#include "/data/xxHash/xxhash.h"
 
static constexpr size_t ITERS = 100000000LL;
 
using namespace std;
 
int main(int argc, char** argv)
{
    volatile uint32_t seed = 0;
    {
        utils::Hasher_r h;
        auto start = chrono::high_resolution_clock::now(); 
        for (size_t i = 0; i < ITERS; ++i)
        {
            for (size_t it = 0; it < 4; ++it)
            {
                int64_t sum = i + it;
                seed = h((const char*)&sum, sizeof(size_t), seed);
            }
            seed = h.finalize(seed, 32);
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "Existing MurMur3 Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
    {
        auto start = chrono::high_resolution_clock::now();
        auto xxh3State = XXH3_createState();
        for (size_t i = 0; i < ITERS; ++i)
        {
            for (size_t it = 0; it < 4; ++it)
            {
                int64_t sum = i + it;
                //seed = XXH3_64bits_withSeed((const void*)&sum, sizeof(size_t), seed);
                XXH3_64bits_update(xxh3State, (const void*)&sum, sizeof(size_t));
            }
            seed = XXH3_64bits_digest(xxh3State);
            XXH3_64bits_reset_withSeed(xxh3State, seed);
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "XXH3_64 Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
    {
        auto start = chrono::high_resolution_clock::now(); 
        volatile size_t nr1 = 1;
        volatile size_t nr2 = 4;
        for (size_t i = 0; i < ITERS; ++i)
        {
            for (size_t it = 0; it < 4; ++it)
            {
                int64_t sum = i + it;
                nr1 = utils::my_hash_sort_bin((const unsigned char*)&sum, sizeof(size_t), nr1, nr2);
            }
        }
        auto end = chrono::high_resolution_clock::now();
        double time_taken = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
        cout << "my_hash_sort_bin Loop time is : " << fixed << time_taken * 1e-9 << setprecision(9) << " sec" << endl;
    }
 
 
}

Generated at Thu Feb 08 02:52:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.