Hi Guys, we have just switched some of our production databases to 10.4.6 and basically everything is working very slowly because of this "feature" called "Conversion of Big IN Predicates Into Subqueries". It's not like 2x slower but tens-of-times slower, basically everything was close to being dead because of cpu utilisation and locking! The problem was so huge, we had to revert whole 2 weeks of migration, because whole cluster we're using was close to a crash on non-working hours, when load is not even there...
Problem is with WHERE ...IN. We were able to force just some queries into using hash joins by increasing join level setting and it mitigated the issue a little. But with this optimizaition, the server is unusable ahttps://jira.mariadb.org/browse/MDEV-20083#nd unresponsive because queries which took like 2-3 seconds now are taking 5-10 minutes, sometimes "forever" and seems to be using some slow as hell, not-indexed temporary tables on badly optimized subqueries(!!), so in fact something which was perfectly optimized, because of these additional optimizations in 10.4 is turned into nested full table scan. I mean there's probably no worse way of implementing such "optimization" and im worried it even got into production code without an optimizer switch...
Sorry for being so critical about this addition, it's normal every update we had some small performance regressions but overal everything worked much better. It was also faster. Just for some stuff it was sometimes like 20-50% slower and it could be "fixed" with some basic reconfiguration, but we never came across something that bad, which makes queries to run tens of times slower, kills CPU and can't be turned off. On the old maria typical CPU usage was 400-500% and it worked fast, for 10.4 it was over 1200-1500%, with 500-1000 locked connections... and simple queries took minutes to complete! So nothing was being served, CPU was overloaded and queries took forever.
In general subqueries are horrible to work with and very badly optimized, so not sure why even such way of "optimizing" memory usage was considered. Maybe im not getting something but this temporaty table which is created is missing a simple index even for simplest WHERE ...IN query, when using single predicate on simplest WHERE. If that's a good feature at least it could be using join which could at least be optimized properly. Not to mention we cannot disable this feature, we can not even force it to do hash joins. As join_cache_level is only forcing some queries to use hash joins which helps a little, but the queries are still much slower than before as the optimizer, not sure why is forcing level 2 joins for some tables, also when using simple WHERE ... IN!
I think the previous way of using some in-memory optimized hash or lookup tree was the right way because it doesn't have all this overhead of subquery, crash-safe temp table and unoptimized join. And the old way is crash-safe, thread-safe and don't need locking because it's per-query and in-memory. The new way isn't even economically viable, CPU is expensive, memory is cheap. And this thing is killing CPU and utilizing I/O for 5 minutes, just to save couple MBs of RAM which is needed for 1-3 seconds? It'll in fact increase memory usage too, because of time it takes...
So i think this "feature" should be considered to be removed altogether as it makes no sense and it would be a huge issue for non-technical users who can't diagnose the problem. We should at least be able to turn it off in my.cnf, but probably it should never be on by default taking its current characteristics, resource usage and how horribly bad it works. Taking couple minutes for selecting some rows from 150-400k rows table which has 20-50mb of data... is seriously broken. It would take couple seconds to do full table select, send the data by network and filter it on the clientside!
Before 10.4 IN was known optimization for BETWEEN ranges which couldn't use indexes most of the time, and we used it so extensively it's now everywhere, because otherwise queries can't use date-based indexes.
In was also known optimization for complicated multi-level joins where you can use it as a quick filter, because it was much faster than excluding rows even by optimized, properly indexed JOIN. And we use it in many places too.
So this optimization also breaks known basic optimizations.
Thanks for any advice. Hopefully this could be removed from the server or at least be disabled by default... below some screens to show how huge this problem really is. We're seeing over 100 TIMES decreased performance when we compare unused test server to production server with same specs.
Selecting 100k rows out of 300k rows table
35MB table size
Takes 270 seconds 100% CPU core usage
Mariadb 10.4 ... join_cache_level=3, takes around 3.5s BUT it's working only for some tables, for others optimizer is using BNL join without index even with cache_level set to 8...
Takes 2-3s on production server with other traffic