[MCOL-4923] Benchmark factoring out the constant string in dictionary filtering. Created: 2021-11-15  Updated: 2021-11-30  Resolved: 2021-11-30

Status: Closed
Project: MariaDB ColumnStore
Component/s: MariaDB Server
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Sergey Zefirov Assignee: Sergey Zefirov
Resolution: Done Votes: 0
Labels: None

Sprint: 2021-15

 Description   

All but two predicates for filtering operations over dictionaries' rows are implemented using strnncollsp function from MariaDB server code. Here's a link to an include fiile which implements such function (it is included in several *.c files and is more or less duplicated).

Note that we initialize two scanners and perform two scanner_next functions in the loop. One such scanner_next function is implemented above in the same file and it also sports a loop and pretty heavy logic inside and outside of that loop.

Usually, these scanning operations are performed until there is a difference or an end of both of strings.

But, for one contrived example, the TPCH benchmark suite sports customer table with c_name column which contains "names" such as 'customer#000000010' - e.g., one cn expect that some 10-11 first characters will be different. Also, c_name column has a case-insensitive collation which is lightweight yet still requires some processing. And, last but not least, queries in TPCH appear to use a lot of filtering over concrete customer names.

It appears that it is worthwhile to have scanning performed only once for a string that is constant in the filtering operation.

Right now there's no such functionality in the server. Alexander Barkov said that it may be beneficial for server too.

The goal of this task is to implement that functionalitty for some (not all - for two or three collations, some simple like latin1_swedish_ci and some complex like utf8_ci) and perform benchmarks. I think it is possible to just copy out some code from strings directory of the server just to demonstrate what can be achieved.



 Comments   
Comment by Sergey Zefirov [ 2021-11-22 ]

Okay, first results:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
WideUTF8CollationProcessingBenchmarkNotFactored    4475399 ns      4492188 ns          160

It is on my notebook which is not the speedest machine out there. 2.5GHz or so, i7, Linux subsystem for Windows VM.

I compare two strings of the same length and same content, 452 bytes of mostly 3-byte wide utf-8 code points.

One such comparison takes about 28 microseconds.

Comment by Sergey Zefirov [ 2021-11-22 ]

Regarding these results:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
WideUTF8CollationProcessingBenchmarkNotFactored    4475399 ns      4492188 ns          160

4.48 milliseconds reported above is an average time for a single iteration. Each such iteration performs 1024 comparisons which result in equality. So each comparison takes thousand times less time - 4.4 microseconds for 452 bytes or about 150 mutibyte symbols.

Comment by Sergey Zefirov [ 2021-11-23 ]

I've employed "-O3 -DNDEBUG" options to better optimize the program and got these results:

2021-11-23 15:50:14
Running ./cfmb
Run on (16 X 2304 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
WideUTF8CollationProcessingBenchmarkNotFactored     828603 ns       837054 ns          896
WideUTF8CollationProcessingBenchmarkFactored        484965 ns       474330 ns         1120

Thus, factored out weight computation saves 44% of run time in the best case.

Comment by Sergey Zefirov [ 2021-11-23 ]

The code lives here for a while.

Interesting bits are in main.cpp.

Here are two benchmarks:

static void WideUTF8CollationProcessingBenchmarkNotFactored(benchmark::State& state)
{
    int neqs = 0;
    SetupWideChars();
    CHARSET_INFO *cs= &my_charset_utf8mb3_general_ci;
    for (auto _ : state)
    {
        int i;
	size_t len = sizeof(latin_chars) - 1;
	for (i=0;i<NUM_SAMPLES;i++)
	{
            if (cs->coll->strnncollsp(cs, pretend_block[i], len, latin_chars, len) != 0)
	    {
                neqs ++;
	    }
	}
    }
    if (neqs)
    {
        printf("there are %d inequalities found, something wrong!\n", neqs);
    }
}
 
static void WideUTF8CollationProcessingBenchmarkFactored(benchmark::State& state)
{
    int neqs = 0;
    SetupWideChars();
    CHARSET_INFO *cs= &my_charset_utf8mb3_general_ci;
    for (auto _ : state)
    {
        int i;
        size_t len = sizeof(latin_chars) - 1;
        size_t buf_len_bytes = cs->coll->strnncollsp_right_buf_size(cs, latin_chars, len);
        void* right_buf = malloc(buf_len_bytes);
        cs->coll->strnncollsp_right_precompute(cs, right_buf, latin_chars, len);
	for (i=0;i<NUM_SAMPLES;i++)
	{
            if (cs->coll->strnncollsp_right(cs, pretend_block[i], len, right_buf) != 0)
	    {
                neqs ++;
	    }
	}
	free(right_buf);
    }
    if (neqs)
    {
        printf("there are %d inequalities found, something wrong!\n", neqs);
    }
}

We obtain the size of buffer to store weights, allocate it and then precompute weights there. Then, for every string in the "batch", we perform strnncollsp function, but only one string gets scanned.

Comment by Sergey Zefirov [ 2021-11-23 ]

The proposed addition in server/include/m_char.h:

struct my_collation_handler_st
{
  my_bool (*init)(struct charset_info_st *, MY_CHARSET_LOADER *);
  /* Collation routines */
  int     (*strnncoll)(CHARSET_INFO *,
                       const uchar *, size_t, const uchar *, size_t, my_bool);
  int     (*strnncollsp)(CHARSET_INFO *,
                         const uchar *, size_t, const uchar *, size_t);
  // this function computes buffer size needed to preprocess weights for "right" argument of strnncollsp.
  // it returns number of bytes to allocate which includes all weights even terminal zero.
  // weight size can vary, for example, the latin1-based collations can use single unsigned char while
  // utf-8 collations should use 32-bit integers.
  size_t     (*strnncollsp_right_buf_size)(CHARSET_INFO*, const uchar*, size_t);
  // this functions preprocess right argument for a strnncollsp, so that only fetch of weight is needed.
  // it uses externally allocated buffer of size indicated by strnncollsp_right_buf_size.
  void       (*strnncollsp_right_precompute)(CHARSET_INFO*, void*, const uchar*, size_t);
  // and, finally, the strnncollsp where right argument has been preprocessed.
  // ass the right string buffer is zero-terminated, we do not pass it's size.
  int        (*strnncollsp_right)(CHARSET_INFO*, const uchar*, size_t, void*);

First lines are for context. The last three "methods" can be added for right string factoring.

I guess some other things needs to be done in other parts of the server code - as we can factor only right string, we need to transform operations like "constant_string OP field_value" to "field_value OP' constant_string" and OP' can be different from OP for operations like ">=" or "<".

Comment by Sergey Zefirov [ 2021-11-30 ]

We have proven that factoring out processing of the query-constant string can be beneficial. In a larger scheme of things the speed up will not be as dramatic but it may be noticeable nonetheless.

We also provided a change needed for a charset interface for this optimization to work. The change required is in server code and warrants different task.

Thus, this task can be closed as there's nothing to do here.

Generated at Thu Feb 08 02:54:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.