[MDEV-33188] Enhance mariadb-dump and mariadb-import capabilities similar to MyDumper - Jira

Details

Type: Epic
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: None
Component/s: None
Labels:
None

Description

Enhance mariadb-dump features similar to MyDumper

Backup

Increase the parallelism, ~~MDEV-32216~~ using --tab option we can able to backup only one database at a time.
In mariadb-dump --tab, the binlog positions are not stored in a separate file. It is displayed in stdout
The big tables were split into multiple data files which will help faster restore.
- this should be done when bulk-load in multiple connection becomes faster

Other features that could be added to mariadb-dump that would make it more useful:

Compression of the dump (especially with --tab). This also implies that we would implement de-compression of the files ion the server (MDEV-28395).
Dump tables not alphabetically, but in inverse size order (biggest tables first). This will help ensuring that the user does not have to wait for the big tables to finish at the end of the backup (in most cases). See benchmarks in ~~MDEV-32216~~.

Restore:

Using mariadb-import can restore only one database at a time and the user need to restore the table structure and data files separately. We should enhance mariadb-import with an option to re-create the table structure first and then start the data import.
In case of customers having 100+ databases mariadb-dump (with --tab) and mariadb-import are not user-friendly. (Not hard to fix)
Have mariadb-import in parallel mode start importing the tables in inverse size order (biggest tables first). This will help ensuring that the user does not have to wait for the big tables to finish at the end of the backup (in most cases).

We need to improve the functionality of mariadb-dump and mariadb-import to allow for the backup and restoration of multiple databases using a single command.

Description of the intended design

allow multiple databases with a new switch --dir (suggestions for a better name welcome, as I got no good idea at the moment)
With this set, mariadb-dump creates a directory tree structure in the given path, <path>/dbname/ for each db, and <tablename>.txt (tab-separated data, created using SELECT INTO OUTFILE) and tablename.sql with DDL . for all tables under the corresponding db directory.
The files are almost exactly the same as before with --tab, except for directory tree

mariadb-import also takes new --dir parameter, that points to directory tree created by dump, executes all *.sql files to create tables (and databases if not exist), loads the data using "LOAD DATA INFILE" for the .txt files

Big tables won't be split into small tables now, and I do not think they should be in the future. If required, threading should be handled transparently by LOAD DATA INFILE, i.e by the server, rather than loading single table from 2 different connections. LOAD DATA is our officiaL bulk-loading interface, and we better improve it in the server, rather than building workarounds.

Attachments

Issue Links

is blocked by

MDEV-34719 Disable purge for LOAD DATA INFILE into empty table

Closed

MDEV-34740 mariadb-import: delay creation of secondary indexes until after data load

Closed

relates to

MDEV-34703 Innodb bulk load : high IO, crashes on Linux, OOM on Linux, as tested with LOAD DATA INFILE

Closed

MDEV-34739 Implement DISABLE KEYS/ENABLE KEYS in Innodb

Open

MDEV-34832 Support adding AUTO_INCREMENT flag to existing numeric using INPLACE

Open

MDEV-28395 LOAD DATA plugins

Open

MDEV-34890 SELECT INTO OUTFILE/LOAD DATA INFILE - option to encode binary data as hex of base64

Open

split to

MDEV-33625 Add option --dir to mariadb-dump

Closed

MDEV-33627 Add option --dir to mariadb-import

Closed

(2 relates to, 2 split to)

Activity

Ascending order - Click to sort in descending order

Sergei Golubchik added a comment - 2024-01-12 22:28 - edited

MyDumper uses database.table.sql(.gz|.zst) (according to https://github.com/mydumper/mydumper/blob/master/docs/files.rst#table-data). Let's use what users might already know? I mean, database.table.txt, not .sql(.gz|.zst) part

Sergei Golubchik added a comment - 2024-01-12 22:28 - edited MyDumper uses database.table.sql(.gz|.zst) (according to https://github.com/mydumper/mydumper/blob/master/docs/files.rst#table-data ). Let's use what users might already know? I mean, database.table.txt , not .sql(.gz|.zst) part

Vladislav Vaintroub added a comment - 2024-01-13 09:58

serg, could you be more precise to "use what users already know". Is this about naming? Is this about 100% compatible output?

1. yes, mydumper uses database.table.sql to store the data, and database.table-schema.sql for DDL.
mariadb-dump --tab uses table.sql to store the DDL, and table.txt to store the data, in tab-separated form

To me it means, we can't possibly make the names 100% compatible, unless we want to change the logic and give a tab-separated values file the extension .sql.

2. If the idea is to store files with "database." prefix, and everything else as proposed, I have no strong objections, although I think directory tree will be clearer, better than potentially thousands files in a flat directory.

3. If the idea is not just to use the same naming schema, AND the very same logic as mydumper/myloader (i.e generate INSERTs rather than "SELECT INTO OUTFILE"), it is quite a different thing and a large change,

Vladislav Vaintroub added a comment - 2024-01-13 09:58 serg , could you be more precise to "use what users already know". Is this about naming? Is this about 100% compatible output? 1. yes, mydumper uses database.table.sql to store the data, and database.table-schema.sql for DDL. mariadb-dump --tab uses table.sql to store the DDL, and table.txt to store the data, in tab-separated form To me it means, we can't possibly make the names 100% compatible, unless we want to change the logic and give a tab-separated values file the extension .sql. 2. If the idea is to store files with "database." prefix, and everything else as proposed, I have no strong objections, although I think directory tree will be clearer, better than potentially thousands files in a flat directory. 3. If the idea is not just to use the same naming schema, AND the very same logic as mydumper/myloader (i.e generate INSERTs rather than "SELECT INTO OUTFILE"), it is quite a different thing and a large change,

Sergei Golubchik added a comment - 2024-01-13 12:49 - edited

No, not 100% compatible. I mentioned that I don't think we should use .sql(.gz|.zst) filename suffix, I'd prefer we use .txt.

Also, mydumper uses database.table-schema.sql(.gz|.zst) for what they call "table schema", I suspect it's a very bad name for table definition and we definitely should not use that. "Schema" is something very different. Your db/table.sql or db.table.sql naming is much better.

#3 might be useful, I thought about it, we might want it some day, but it's a different task, definitely not part of this one.

So, above I only meant #2, that is, just the naming. I don't have a strong preference either way, so I checked what other tools do.

pandi.gurusamy, do you have an opinion? you've requested this feature, after all. If dumping many databases at once, should files named db/table.txt (that is, one directory per database) or db.table.txt (potentially thousands files in a flat directory) ?

Sergei Golubchik added a comment - 2024-01-13 12:49 - edited No, not 100% compatible. I mentioned that I don't think we should use .sql(.gz|.zst) filename suffix, I'd prefer we use .txt . Also, mydumper uses database.table-schema.sql(.gz|.zst) for what they call "table schema", I suspect it's a very bad name for table definition and we definitely should not use that. "Schema" is something very different. Your db/table.sql or db.table.sql naming is much better. #3 might be useful, I thought about it, we might want it some day, but it's a different task, definitely not part of this one. So, above I only meant #2, that is, just the naming. I don't have a strong preference either way, so I checked what other tools do. pandi.gurusamy , do you have an opinion? you've requested this feature, after all. If dumping many databases at once, should files named db/table.txt (that is, one directory per database) or db.table.txt (potentially thousands files in a flat directory) ?

Pandikrishnan Gurusamy added a comment - 2024-01-15 13:23

it's a good idea to have db/table.txt and db/table.sql, while restoring mariadb-import should automatically restore the tables into the correct databases when there are hundreds of database.

Pandikrishnan Gurusamy added a comment - 2024-01-15 13:23 it's a good idea to have db/table.txt and db/table.sql, while restoring mariadb-import should automatically restore the tables into the correct databases when there are hundreds of database.

Vladislav Vaintroub added a comment - 2024-05-27 09:09

Subtask ~~MDEV-33627~~ is in progress

Vladislav Vaintroub added a comment - 2024-05-27 09:09 Subtask MDEV-33627 is in progress

Johan Eklund added a comment - 2024-06-18 10:06 - edited

Thanks for improving the classic export/import utility. Would it be possible to add support for --hex-blob with --tab/--parallel/--dir to these enhancements? Would be great to be able to do speed up binary exports/imports as well. Right now it's silently ignored. We mainly use InnoDB so would be great if it could be used together with --single-transaction.

Johan Eklund added a comment - 2024-06-18 10:06 - edited Thanks for improving the classic export/import utility. Would it be possible to add support for --hex-blob with --tab/--parallel/--dir to these enhancements? Would be great to be able to do speed up binary exports/imports as well. Right now it's silently ignored. We mainly use InnoDB so would be great if it could be used together with --single-transaction.

Sergei Golubchik added a comment - 2024-06-18 12:39

--single-transaction can be emulated with a global read lock, I suppose.

--hex-blob is only defined for SQL dumps. And --tab uses SELECT ... INTO OUTFILE. This needs to be implemented in the server. And for LOAD DATA too.

Sergei Golubchik added a comment - 2024-06-18 12:39 --single-transaction can be emulated with a global read lock, I suppose. --hex-blob is only defined for SQL dumps. And --tab uses SELECT ... INTO OUTFILE . This needs to be implemented in the server. And for LOAD DATA too.

Vladislav Vaintroub added a comment - 2024-06-18 12:42

--single-transaction works out of the box with --parallel

Vladislav Vaintroub added a comment - 2024-06-18 12:42 --single-transaction works out of the box with --parallel

Johan Eklund added a comment - 2024-06-18 13:24

Thanks for the updated information. I have no idea how complex it would be, to add support for --hex-blob. I knew it wasn't just one line of code!
I have just looked at performance charts for mydumper and other alternatives, and it's quite massive compared to plain old SQL-dumps.

Johan Eklund added a comment - 2024-06-18 13:24 Thanks for the updated information. I have no idea how complex it would be, to add support for --hex-blob. I knew it wasn't just one line of code! I have just looked at performance charts for mydumper and other alternatives, and it's quite massive compared to plain old SQL-dumps.

Marko Mäkelä added a comment - 2024-08-29 06:40

MDEV-34832 points out that AUTO_INCREMENT attributes had better be added after loading data.

Marko Mäkelä added a comment - 2024-08-29 06:40 MDEV-34832 points out that AUTO_INCREMENT attributes had better be added after loading data.

Vladislav Vaintroub added a comment - 2024-08-31 13:21 - edited

I'm unsure that AUTO_INCREMENT is better to add after loading data. Why so? When we load data, we define all fields. We do not calculate autoincrement. If there is still overhead, even when autoinc field is defined, it needs to be solved by storage engines.

Vladislav Vaintroub added a comment - 2024-08-31 13:21 - edited I'm unsure that AUTO_INCREMENT is better to add after loading data. Why so? When we load data, we define all fields. We do not calculate autoincrement. If there is still overhead, even when autoinc field is defined, it needs to be solved by storage engines.

People

Assignee:: Vladislav Vaintroub

Reporter:: Pandikrishnan Gurusamy

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 2024-01-05 13:14

Updated:: 2024-12-18 14:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server