[MDEV-22760] Bulk INSERT...ON DUPLICATE KEY UPDATE updates only a fraction of rows Created: 2020-05-30  Updated: 2023-07-12

Status: Confirmed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.4.13, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Serge Rogatch Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Windows 10 64-bit


Attachments: Text File MariaDBBugInsertUpdate.cpp     Text File QueryLog.txt    
Issue Links:
Problem/Incident
is caused by MDEV-371 Unique indexes for blobs Closed

 Description   

Consider the following table:

CREATE TABLE `demo_upsert` (
    `ID` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `Url` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `LastListID` BIGINT(20) NULL DEFAULT NULL,
    `Statistics` BIGINT(20) NOT NULL DEFAULT '0',
    PRIMARY KEY (`ID`) USING BTREE,
    UNIQUE INDEX `Url` (`Url`) USING HASH
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
;

When running the following program:

#include <cassert>
#include <string>
#include <random>
#include <iostream>
#include "mariadb/mysql.h"
#include "Secret.h" // for database credentials
using namespace std;
 
string randStr(const int len, mt19937_64& rng) {
  string ans;
  for (int i = 0; i < len; i++) {
    ans += ('a' + rng() % 26);
  }
  return ans;
}
 
int main() {
  MYSQL mysql;
  mysql_library_init(0, nullptr, nullptr);
  mysql_thread_init();
  MYSQL* pmsRet = mysql_init(&mysql);
  assert(pmsRet);
  pmsRet = mysql_real_connect(&mysql, ConnProp::_cpHost, ConnProp::_cpUser, ConnProp::_cpPasswd,
    ConnProp::_cpDB, ConnProp::_cpPort, nullptr, CLIENT_COMPRESS | CLIENT_LOCAL_FILES | CLIENT_MULTI_STATEMENTS
    | CLIENT_MULTI_RESULTS);
  assert(pmsRet);
  const int lastListID = 10;
  mt19937_64 rng;
  for (int i = 0; i < 3; i++) {
    string query;
    query = "INSERT INTO demo_upsert (Url, LastListID) VALUES ";
    for (int j = 0; j < 15000; j++) {
      if (j != 0) {
        query += ", ";
      }
      query += "('" + randStr(3, rng) + "', " + to_string(lastListID) + ")";
    }
    query += " ON DUPLICATE KEY UPDATE LastListID=" + to_string(lastListID);
    int iRet = mysql_real_query(&mysql, query.c_str(), query.size());
    assert(iRet == 0);
    int64_t nAffected = mysql_affected_rows(&mysql);
    cout << nAffected << ' ';
    query = "UPDATE demo_upsert SET LastListID=NULL";
    iRet = mysql_real_query(&mysql, query.c_str(), query.size());
    assert(iRet == 0);
    nAffected = mysql_affected_rows(&mysql);
    cout << nAffected << endl;
  }
  return 0;
}

I get for example the following output:
10130 10130
11627 5814
11620 5811

It indicates the following:
Initially, when the table is empty, bulk INSERT...ON DUPLICATE KEY UPDATE seems to insert correctly 10130 distinct values (out of 15000 non-distinct) because the following UPDATE query sets exactly that amount to NULL.
However, the next bulk INSERT...ON DUPLICATE KEY UPDATE sets LastListID to non-null for only 5814 distinct rows out of ~10K expected distinct rows out of 15K non-distinct. We infer this because the following UPDATE reports affected only 5814 rows, meaning that the previous INSERT...ON DUPLICATE KEY UPDATE called UPDATE only for that number of rows.

I cannot reproduce the problem from HeidiSQL 11.0.0.5989. I could only reproduce it when using Connector/C versions 3.1.7 and 3.1.8 (I didn't try the older versions). The problem appears for both a direct and prepared statement.

MariaDB server versions tried: 10.4.13 and 10.4.12.



 Comments   
Comment by Georg Richter [ 2020-05-31 ]

Thanks for your bug report.

Can you please also report the numbers HeidiSQL returns and in which sql mode HeidiSQL is running (select @@sql_mode) ?

Comment by Serge Rogatch [ 2020-05-31 ]

Thank you for starting to look into this quickly. I'm going to do the above experiments right now and will post here afterward.

Comment by Serge Rogatch [ 2020-05-31 ]

When I just dump the queries into a SQL file and then copy-paste the file into HeidiSQL and run the 6 queries, I get the following output:
/* Affected rows: 55,132 Found rows: 0 Warnings: 0 Duration for 6 queries: 1.140 sec. */

The updated program and the script are attached.
MariaDBBugInsertUpdate.cpp QueryLog.txt

Below is the output for select @@sql_mode in HeidiSQL:
STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION

When I run the queries one by one, I get:
INSERT #1: /* Affected rows: 10,130 Found rows: 0 Warnings: 0 Duration for 1 query: 0.594 sec. */
UPDATE #1: /* Affected rows: 10,130 Found rows: 0 Warnings: 0 Duration for 1 query: 0.110 sec. */
INSERT #2: /* Affected rows: 11,627 Found rows: 0 Warnings: 0 Duration for 1 query: 0.235 sec. */
UPDATE #2: /* Affected rows: 5,814 Found rows: 0 Warnings: 0 Duration for 1 query: 0.093 sec. */
INSERT #3: /* Affected rows: 11,620 Found rows: 0 Warnings: 0 Duration for 1 query: 0.188 sec. */
UPDATE #3: /* Affected rows: 5,811 Found rows: 0 Warnings: 0 Duration for 1 query: 0.062 sec. */

So apparently HeidiSQL behaves the same way when the queries are run one by one. Sorry for misaddressing the bug report. Now I tend to think that the error is in the core DB server. Is there a way to migrate the issue to the proper project?

Comment by Georg Richter [ 2020-05-31 ]

changed to server bug (MDEV)

Comment by Elena Stepanova [ 2020-07-08 ]

Thanks for the report and test case.

The culprit here is the long unique index on blob (text) column.
For a workaround, you could use a prefixed unique key instead of the long blob, e.g.

CREATE TABLE `demo_upsert` (
    `ID` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `Url` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `LastListID` BIGINT(20) NULL DEFAULT NULL,
    `Statistics` BIGINT(20) NOT NULL DEFAULT '0',
    PRIMARY KEY (`ID`) USING BTREE,
    UNIQUE INDEX `Url` (`Url`(20))
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
;

(or whatever reasonable length suits your actual data).

The actual problem here appears to be not as much that existing values don't get updated, but mainly that new values don't get inserted. It is not easy to infer from the behavior of UPDATE.

Let's take the 1st and 2nd INSERT/UPDATE from your SQL an insert them into separate identical tables. demo_upsert1 below contains results of the 1st INSERT, and demo_upsert2 – results of the 2nd INSERT.

MariaDB [test]> select count(*) from demo_upsert1;
+----------+
| count(*) |
+----------+
|    10130 |
+----------+
1 row in set (0.032 sec)
 
MariaDB [test]> select count(*) from demo_upsert2;
+----------+
| count(*) |
+----------+
|    10014 |
+----------+
1 row in set (0.032 sec)
 
MariaDB [test]> select count(*) from (select Url from demo_upsert1 intersect select Url from demo_upsert2) sq;
+----------+
| count(*) |
+----------+
|     5817 |
+----------+
1 row in set (1.157 sec)
 
MariaDB [test]> select count(*) from (select Url from demo_upsert2 except select Url from demo_upsert1) sq;
+----------+
| count(*) |
+----------+
|     4197 |
+----------+
1 row in set (1.119 sec)

So, the 1st INSERT is supposed to insert 10130 rows, which it does.
Then the 1st UPDATE is supposed to update all 10130 rows from 10 to NULL, which it also does.
The 2nd INSERT is supposed to update 5817 duplicate rows, and insert 4197 ones.

For new 4197 rows the value of LastListID would be inserted as 10, for 5817 duplicates it will be updated from NULL to 10, but there would also be 10130-5817=4313 rows left intact with LastListID NULL. Thus, just from the behavior of UPDATE you can't say whether it updates a fraction of values because the previous upsert didn't update existing ones or didn't insert new ones.

Here is a simplified test case where we can see the actual values. The structures and values are taken from the original one, I just reduced the number of rows.

--source include/have_innodb.inc
 
DROP TABLE IF EXISTS demo_upsert;
 
CREATE TABLE `demo_upsert` (
    `ID` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `Url` TEXT,
    `LastListID` BIGINT(20) NULL DEFAULT NULL,
    PRIMARY KEY (`ID`) USING BTREE,
    UNIQUE INDEX `Url` (`Url`)
) ENGINE=InnoDB;
 
INSERT INTO demo_upsert (Url) VALUES ('iaa'),('qlf');
 
SELECT * FROM demo_upsert;
 
INSERT INTO demo_upsert (Url) VALUES
  ('knx'),
  ('qlf'), # This is a duplicate value
  ('bdu')
ON DUPLICATE KEY UPDATE LastListID=10;
 
SELECT * FROM demo_upsert;
 
# Cleanup
DROP TABLE demo_upsert;

The first SELECT returns the expected two rows which were just inserted:

MariaDB [test]> SELECT * FROM demo_upsert;
+----+------+------------+
| ID | Url  | LastListID |
+----+------+------------+
|  1 | iaa  |       NULL |
|  2 | qlf  |       NULL |
+----+------+------------+
2 rows in set (0.000 sec)

The second INSERT is supposed to update one row (qlf) and insert two new ones, knx and bdu.

MariaDB [test]> INSERT INTO demo_upsert (Url) VALUES
    ->   ('knx'),
    ->   ('qlf'), # This is a duplicate value
    ->   ('bdu')
    -> ON DUPLICATE KEY UPDATE LastListID=10;
Query OK, 5 rows affected (0.048 sec)
Records: 3  Duplicates: 2  Warnings: 0

Stats are already wrong. There is only 1 duplicate, not 2, and the total number of affected rows should be 4 (because upserted rows are counted twice), but not 5.
And the following SELECT shows that the contents is wrong as well:

MariaDB [test]> SELECT * FROM demo_upsert;
+----+------+------------+
| ID | Url  | LastListID |
+----+------+------------+
|  1 | iaa  |       NULL |
|  2 | qlf  |         10 |
|  3 | knx  |         10 |
+----+------+------------+
3 rows in set (0.000 sec)

The duplicate row did get updated, but out of 2 new ones only one got inserted.

Comment by Serge Rogatch [ 2020-07-08 ]

Thank you for looking into our issue.

Comment by steven chang [ 2020-10-15 ]

Guys,

To my understanding, INSERT...ON DUPLICATE KEY UPDATE can only be applied tables with only one unique key (either pk or uk) . Your case have 2 .

Generated at Thu Feb 08 09:17:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.