Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
Description
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in MDEV-23162.
The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str().
The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux().
Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux():
- We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source is ASCII compatible.
- There is no a need to call a virtual method to get a Field's repertoire per every row: field->dtcollation().repertoire. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII.
So for numeric and temporal data instead of calling this generic string method in store_string_aux():
bool needs_conversion(CHARSET_INFO *fromcs, |
my_repertoire_t from_repertoire,
|
CHARSET_INFO *tocs) const |
{
|
// 'tocs' is set 0 when client issues SET character_set_results=NULL |
return tocs && !my_charset_same(fromcs, tocs) && |
fromcs != &my_charset_bin &&
|
tocs != &my_charset_bin &&
|
(from_repertoire != MY_REPERTOIRE_ASCII ||
|
(fromcs->state & MY_CS_NONASCII) ||
|
(tocs->state & MY_CS_NONASCII));
|
}
|
|
the condition in store_numeric_string_aux() will be as simple as this:
{
|
if (tocs && (tocs->state & MY_CS_NONASCII)) |
conversion;
|
else |
no_conversion;
|
}
|
And this generic string call:
return store_string_aux(str.ptr(), str.length(), str.charset(), |
field->dtcollation().repertoire, tocs);
|
for numeric and temporal data will change to:
return store_numeric_string_aux(from, length); |
Notice, we remove a lot of parameters and one virtual call.
Bar proposed additional changes:
As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types
(e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following:
- Rename the method Field::send_binary() to just Field::send().
- Reuse Field::send() for all protocol types, including Protocol_text
This approach makes the code very symmetric:
- For all protocol types
- For Field vs Item, because Items also send themself using a single method Item::send(), which is used for all protocol types.
Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs.
Benchmarking
After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation.
Tools used:
- Server binaries compiled with DebWithRelInfo for four different commits (see below)
- A test table created by this script:
CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP;
DELIMITER $$
BEGIN NOT ATOMIC
FOR i IN 1..1000*1000 DO
INSERT INTO t1 VALUES (DEFAULT);
END FOR;
END;
$$
DELIMITER ;
- A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached.
The result of "time ./num" for various commits:
1. before - the commit before the repertoire-based optimization
|
be98036f25ac8cfb34fa5bb5066975d79f595aec
|
|
1m9.336s
|
1m9.290s
|
1m9.300s
|
2. repertoire - MDEV-23162 - the repertoire-based optimization
|
eb2eaba7fdbd13c9814ab4619cc23d9f140e5485
|
|
1m6.101s
|
1m5.988s
|
1m6.264s
|
3. revert - the commit reverting the repertoire-based optimization
|
f1a9700fec8312bdce3e7a7145389adede1722b2
|
|
1m8.895s
|
1m9.014s
|
1m8.883s
|
4. new - MDEV-23478 - new optimization based on a single Field::send()
|
c55f24cd99f3c6f001c210bc83f1f6b5b106bf83
|
|
1m2.150s
|
1m2.079s
|
1m2.099s
|
The new approach (#4) stably demonstrates better performance over the repertoire based one (#2).
Attachments
Issue Links
- relates to
-
MDEV-23162 Improve Protocol performance for numeric data
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Link |
This issue relates to |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in text protocols, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item: because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Attachment | INSERT.sql [ 53251 ] | |
Attachment | Makefile [ 53252 ] | |
Attachment | num.cc [ 53253 ] |
issue.field.resolutiondate | 2020-08-14 06:13:50.0 | 2020-08-14 06:13:50.901 |
Fix Version/s | 10.5.6 [ 24508 ] | |
Fix Version/s | 10.5 [ 23123 ] | |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Closed [ 6 ] |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec in 10.5 1m9.336s 1m9.290s 1m9.300s 2. repertoire - the commit with the repertoire-based optimization eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 in 10.5 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization, applied on top of the current 10.5 (in bb-10.5-bar) 1m8.895s 1m9.014s 1m8.883s 4. new - the commit adding optimization based on a single Field::send() applied on top of #3 (in bb-10.5-bar) 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec 1m9.336s 1m9.290s 1m9.300s 2. repertoire - eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization f1a9700fec8312bdce3e7a7145389adede1722b2 1m8.895s 1m9.014s 1m8.883s 4. new - c55f24cd99f3c6f001c210bc83f1f6b5b106bf83 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {noformat} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec 1m9.336s 1m9.290s 1m9.300s 2. repertoire - eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization f1a9700fec8312bdce3e7a7145389adede1722b2 1m8.895s 1m9.014s 1m8.883s 4. new - c55f24cd99f3c6f001c210bc83f1f6b5b106bf83 1m2.150s 1m2.079s 1m2.099s {noformat} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {code} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec 1m9.336s 1m9.290s 1m9.300s 2. repertoire - eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization f1a9700fec8312bdce3e7a7145389adede1722b2 1m8.895s 1m9.014s 1m8.883s 4. new - c55f24cd99f3c6f001c210bc83f1f6b5b106bf83 1m2.150s 1m2.079s 1m2.099s {code} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Description |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {code} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec 1m9.336s 1m9.290s 1m9.300s 2. repertoire - eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 1m6.101s 1m5.988s 1m6.264s 3. revert - the commit reverting the repertoire-based optimization f1a9700fec8312bdce3e7a7145389adede1722b2 1m8.895s 1m9.014s 1m8.883s 4. new - c55f24cd99f3c6f001c210bc83f1f6b5b106bf83 1m2.150s 1m2.079s 1m2.099s {code} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Monty proposed an alternative solution, instead of the repertoire-based improvement implemented in The idea is to make Field send itself to Protocol_text using data type specific Protocol methods (like in Protocol_binary) rather than field->val_str() followed by protocol_text->store_str(). The data type specific methods for numeric and temporal data types in Protocol_text can use a new method Protocol_text::store_numeric_string_aux() instead of the generic store_string_aux(). Additional performance improvement (comparing to the previous implementation with repertoire) should be achieved because the test condition in store_numeric_string_aux() detecting if a character set conversion is needed is much simpler than the one in store_string_aux(): - We need to test only if character_set_results is non-ASCII compatible (e.g. UCS2, UTF16, UTF32), but we don't need to check the source character set for &my_charset_bin or for ASCII compatibility: in case of numeric and temporal data we know that the source *is* ASCII compatible. - There is no a need to call a virtual method to get a Field's repertoire per every row: {{field->dtcollation().repertoire}}. For numeric and temporal data we know that the source data has repertoire MY_REPERTOIRE_ASCII. So for numeric and temporal data instead of calling this generic string method in store_string_aux(): {code:cpp} bool needs_conversion(CHARSET_INFO *fromcs, my_repertoire_t from_repertoire, CHARSET_INFO *tocs) const { // 'tocs' is set 0 when client issues SET character_set_results=NULL return tocs && !my_charset_same(fromcs, tocs) && fromcs != &my_charset_bin && tocs != &my_charset_bin && (from_repertoire != MY_REPERTOIRE_ASCII || (fromcs->state & MY_CS_NONASCII) || (tocs->state & MY_CS_NONASCII)); } {code} the condition in store_numeric_string_aux() will be as simple as this: {code:cpp} { if (tocs && (tocs->state & MY_CS_NONASCII)) conversion; else no_conversion; } {code} And this generic string call: {code:cpp} return store_string_aux(str.ptr(), str.length(), str.charset(), field->dtcollation().repertoire, tocs); {code} for numeric and temporal data will change to: {code:cpp} return store_numeric_string_aux(from, length); {code} Notice, we remove a lot of parameters and one virtual call. Bar proposed additional changes: As after introducing data type specific methods in Protocol_text, Field will send itself in a very similar way to all protocol types (e.g. Protocol_binary, Protocol_text, Protocol_local), let's do the following: - Rename the method Field::send_binary() to just Field::send(). - Reuse Field::send() for all protocol types, including Protocol_text This approach makes the code very symmetric: - For all protocol types - For Field vs Item, because Items also send themself using a single method Item::send(), which is used for *all* protocol types. Such unification will allow to reuse the code between protocols and Field vs Item, which means less potential bugs. h2. Benchmarking After implementing, the new approach demonstrated improvement comparing to the repertoire based implementation. Tools used: - Server binaries compiled with DebWithRelInfo for four different commits (see below) - A test table created by this script: {code:sql} CREATE OR REPLACE TABLE t1 (a BIGINT DEFAULT 100000000000) ENGINE=HEAP; DELIMITER $$ BEGIN NOT ATOMIC FOR i IN 1..1000*1000 DO INSERT INTO t1 VALUES (DEFAULT); END FOR; END; $$ DELIMITER ; {code} - A client program doing "SET NAMES utf8" on connect, then doing (many times) "SELECT * FROM t1". See attached. The result of "time ./num" for various commits: {code} 1. before - the commit before the repertoire-based optimization be98036f25ac8cfb34fa5bb5066975d79f595aec 1m9.336s 1m9.290s 1m9.300s {code} {code} 2. repertoire - eb2eaba7fdbd13c9814ab4619cc23d9f140e5485 1m6.101s 1m5.988s 1m6.264s {code} {code} 3. revert - the commit reverting the repertoire-based optimization f1a9700fec8312bdce3e7a7145389adede1722b2 1m8.895s 1m9.014s 1m8.883s {code} {code} 4. new - c55f24cd99f3c6f001c210bc83f1f6b5b106bf83 1m2.150s 1m2.079s 1m2.099s {code} The new approach (#4) stably demonstrates better performance over the repertoire based one (#2). |
Fix Version/s | 10.5.7 [ 25019 ] |
Fix Version/s | 10.5.6 [ 24508 ] |
Workflow | MariaDB v3 [ 112416 ] | MariaDB v4 [ 134320 ] |
Remote Link | This issue links to "Page (MariaDB Confluence)" [ 36710 ] |
Remote Link | This issue links to "Page (MariaDB Confluence)" [ 36710 ] |