[MDEV-218] "SEMI JOIN" support as a "real" JOIN Created: 2012-04-10 Updated: 2014-04-15 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Minor |
| Reporter: | Jaimie Sirovich | Assignee: | Sergei Petrunia |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
First, I apologize if I'm not submitting this to the right section. It's not a bug, but I don't see "Feature Request." I'll try to keep this, um, short, not epic. OVERVIEW – Well, first, up until now the best you could do is use an index-subquery, which was a very limited approach for an exact type of query. Now we have options. But before that in the stone age, what would lots of people do? They'd do a LEFT or INNER JOIN and add on a 'hacky' GROUP BY at the end. In fact that's one of your execution approaches for semi-joins. But it's not always the best one. It depends. But ... The nice part about this awful hack is that you get the columns from the table you're 'semi-joining' to. Sometimes you don't care which row the extra columns are from, and you just need to ensure you get 1 row. Also, in the case that the planner knows it can use an INNER JOIN strategy and rewrite the query for you, it would be 100% valid and loss-less. In other cases, you might just know that what you want is the same across the many rows it would be encountering anyway (yes, I know that's denormalized, but real life isn't always normal). Lots of people summarizing data end up with schemas like this. I know this would not pass muster with the SQL normalization overlords, but it would be darn useful and very much in the spirit of MySQL. Thoughts? |
| Comments |
| Comment by Sergei Petrunia [ 2012-05-16 ] |
|
Do I understand it correctly: you're requesting that MariaDB's SQL dialect has support for SEMI JOIN construct? So that it allows queries like SELECT ... FROM table1 SEMI JOIN table2 WHERE .... ? |
| Comment by Jaimie Sirovich [ 2012-05-16 ] |
|
Yes. The difference would be that with "SEMI JOIN," the columns from both table1 and table2 would be materialized. The result would be undefined (in the same way as a GROUP BY is) in certain cases, but sometimes you either: 1. Don't care. Using a subquery will not permit to express this, and using GROUP BY to remove the duplicate rows is only 1 execution strategy for certain SEMI JOINs. But it's the only one, as it stands, that allows for materialization from both sides of the JOIN. Make sense? |
| Comment by Jaimie Sirovich [ 2012-05-18 ] |
|
Sighted here: http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html Possibly because they don't support subqueries. I'm not sure why referencing the right-hand table must be prohibited, either. It can be allowed. Yes, in some cases it's undefined, but only sometimes. "LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc. This type of query SELECT a.key, a.value SELECT a.key, a.val |
| Comment by Sergei Petrunia [ 2012-05-18 ] |
|
Interesting... Internally, MariaDB 5.3+ supports semi-joins, one can see it by running EXPLAIN As for references to semi-join-inner tables from other parts of the query, the There is a certain difference from LEFT JOINs. With LEFT JOIN, the WHERE condition With semi-join (in current MariaDB, and in MySQL 5.6) the WHERE clause affects the SELECT * FROM country And, consider a semi-join, which has reference to the subquery table: SELECT ... Current semi-join code will evaluate this with this semantics (SEM1): for each country } One could expect the following (SEM2) : for each country { find some city $C that belongs to that country; if ($C.has_an_airport) return {country, city} pair; } With (SEM1), one can freely move conditions between WHERE and ON expression, Another possible concern is semi-join on the inner side of an outer join. Its
|
| Comment by Sergei Petrunia [ 2012-05-18 ] |
|
The main question is, do you expect (SEM1), or (SEM2) ? |
| Comment by Jaimie Sirovich [ 2012-05-19 ] |
|
I'm pretty sure you want SEM1, so that's good news. SEM2 is only correct when the multiple rows are 100% duplicate with respect to what you care about. I think it's always wrong, then, actually. If you expect SEM2, you'd expect totally random results based on how the optimizer decides to walk over the data, no? SEM1 is consistent. As far as OUTERs . . . I'd vote for only implementing INNER JOIN for now, since it looks like that's mostly an issue of syntax. If it's adopted (i.e. anyone actually cares about it other than we), outer joins could be implemented at a later date. I can see uses for both, actually, but SQL only had inner joins at the beginning of the SQL universe and relational algebra, so it wouldn't be strange at all to have a limit that SEMI joins are INNER JOINs. What I find strange to begin with is that this syntax doesn't exist in pretty much any implementation, and that it must be represented as a subquery just to be then turned back into a (semi) join. Why? The "semi join" is a concept you can find as a bona fide join in academic textbooks. It has its own funny symbol and everything. Using subquery semantics seems pained because in relational algebra you're supposed to think of sets, not loops and conditions, which is what subqueries tend to 'think' about. Usually, when you see subqueries everywhere, it means "bad programmer." In the old days of MySQL, everyone said "use joins stupid!" ... except that resulted in using a join with a 'hacky group by,' which is roughly "DUPLICATE WEED OUT." That's very limiting and kind of the worst plan in most cases, though So I don't see any huge problems if it's only SEMI INNER JOINs. In certain cases you could still apply TABLE PULLOUT (well, it's no longer a pull out) to make it a true INNER JOIN based on schema. MySQL previously implemented INDEX_SUBQUERY as a non-generic semi join for very limited cases (it looks like this is FIRST_MATCH now). However, it still had to be represented as a subquery and suffers from the same painful syntax and lack of ability to select from. I guess materialization is like joining to a non-correlated subquery. Loosescan ... I'm not sure there's a way to express that at all. Either way, you would have the ability to query something (that albeit could be denormalized) and 'see everything.' That's a (very nice) side effect, though. Sometimes you know it's OK, so why not? If you care, then you can set something like sqlmode_only_full_group_by, but only purists would care. Really I think it's odd that everyone uses the pained subquery syntax to express something that IS a bona fide join. Hope this is clear. I don't write optimizers, so . . . I could be totally wrong on some of these things. |
| Comment by Jaimie Sirovich [ 2012-12-03 ] |
|
I've been idling on this. Our of curiousity — and don't get me misconstrued we're a small development firm — what would be the real cost to sponsor something like this? If I had experience with the MySQL source tree I'd think about tinkering with it myself. It doesn't sound that difficult as long as it only works for INNERs. Anything else sounds much harder. |