Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
2.5.20, 6.3.1
-
None
Description
If the duplicate checks are enabled, the cost of performing the duplicate check grows very fast as the number of tables increases. With around 50000 tables and with the default duplicate checks, it takes on average 25 seconds to do the duplicate checks. With ignore_tables_regex=.* the time drops to around 500 milliseconds of which a large part is network latency.
The reason why it is so slow is that for each visible table, a lookup into the table location is done while the result is being iterated. As the location lookup processes all tables (a somewhat dumb approach), it ends up iterating the table once per row resulting in roughly quadratic complexity. By first inserting all the elements into the resulting container, the duplicate check can be done later in a single pass over the whole container. This results in linear complexity which works out far better.