[MDEV-29959] UUID Sorting Created: 2022-11-06 Updated: 2023-12-19 Resolved: 2023-07-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 11.0.0, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0 |
| Fix Version/s: | 10.9.8, 10.10.6, 10.11.5, 11.0.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Rick Strafy | Assignee: | Sergei Golubchik |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | uuid | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Description |
|
Ricky's original entry: Hi, is it possible to revert rearranging bits when inserting new uuid? Monty; It seams that the main confusion is that when doing an ORDER BY I have addded a little bit of background to allow us to better MariaDB has a few different uuid's (all unique in the server and uuid()
sys_guid()
uuid_short()
UUID data type in 10.7.0
Other things
For storage engine performance there are two things that one would 1) When doing bulk insert, having UUID in order makes inserts of the In other words, it depends on the usage of UUID keys how they should When using UUID type or uuid_short() one gets benefit 1) After reading the comment on this Jira entry, the conclusion with current
|
| Comments |
| Comment by Sergei Golubchik [ 2022-11-06 ] | |||||||||||||||||
|
At of now MariaDB can only generate UUID v1. For externally generated UUIDs, rfc4122 specifies only UUID versions up to v5 (where v3 to v5 are unordered, so bit rearranging doesn't hurt them). v6 to v8 are specified in a draft that was created on 23 June 2022 and that expires on 25 December 2022. You're right that if there will be new standard UUID versions where bit rearranging decreases locality then MariaDB should be changed to look into the version field and rearrange bits conditionally. | |||||||||||||||||
| Comment by Rick Strafy [ 2022-11-06 ] | |||||||||||||||||
|
Hm, but lot of the main uuid libraries (I checked php, ruby, python, rust, go) have already implemented at least v6 and v7 uuid, even if they are still a draft, it's clear that they will be time based and time must be within the first part to make it sortable. People are starting to experiment with that new versions, and it was really surprising how sorting works, so I'm switching back to binary(16) or postgres, I think the best time for that change is now when there is barely no one using that new datetype, since last LTS is 10.6, because it may be a BC break, I suggest not to do such magic at all, since there are people who may use ULIDs with UUID string representation, and that place where should be version specification is a random number 0-f, so if there is condition only for v1 uuid, every ~16. generated ulid will be inserted randomly and not at the end of the table. | |||||||||||||||||
| Comment by Rick Strafy [ 2022-11-06 ] | |||||||||||||||||
|
And people are switching from BINARY(16) or VARCHAR(36) to UUID, and when they figure out that sorting is totally different, they need to stay on binary, or they need to rearrange bits, and if they do that and you will revert the change, well. | |||||||||||||||||
| Comment by Jan [ 2022-11-08 ] | |||||||||||||||||
|
I think taht good enough solution would by types UUID(1) and UUID(0). One is reanranged, second one is not. I dont understand how someone could think, that forced rearranging is good idea. | |||||||||||||||||
| Comment by Rick Strafy [ 2022-11-11 ] | |||||||||||||||||
|
Will it be at least opt-out? | |||||||||||||||||
| Comment by Sergei Golubchik [ 2022-11-18 ] | |||||||||||||||||
|
I think the server should just unconditionally bit-swap UUID versions 5 and below, and not bit-swap versions 6 and above. There's hardly a reason to make this optional with opt-in or opt-out. Technically, there is no reason to bit-swap versions 3 to 5, but it's likely too late, we might have users who store those UUIDs in a table. And v6+ are new enough. Also, perhaps, CHECK TABLE FOR UPGRADE will be able to detect it. | |||||||||||||||||
| Comment by Jan [ 2022-11-18 ] | |||||||||||||||||
|
Mako no sense to me. Wouldn't be better just let current stored UUID let as is and when you resave than server choose swap or not? Same problem is for v6+, there may be late too and i'm not sure if it's valid argument. | |||||||||||||||||
| Comment by Rick Strafy [ 2022-11-18 ] | |||||||||||||||||
|
I won't argue with that, because for now I will be thankful for any fix of versions above 5, because I stayed on uuid since I believe this will be resolved, and I'm getting full memory consumption when selecting and migrating data, it was doing ok with binary and ULID. Is there any eta for fixing this at least for v6 and v7? Because I want to release new version of my app at the end of the month, and I want to have optimal db performance, currently I'm using MariaDB 10.10. The only problem I see is that if someone will use uuid as his custom integer representation, or ULID (https://github.com/ulid/spec), migration from binary to uuid will be painful when they realize that "v1" uuid is saved differently and the sorting will skip every X. uuid. It is a good idea to rearrange bits in v1 so it can be sortable, but I think it should be optional, but I'm happy if it will be fixed at least for versions > 5. Btw, will 10.11 be LTS? | |||||||||||||||||
| Comment by Rick Strafy [ 2022-11-28 ] | |||||||||||||||||
|
Hi, any progress on this? | |||||||||||||||||
| Comment by Alexander Barkov [ 2022-12-01 ] | |||||||||||||||||
|
The UUID data type was designed to store the result of the function UUID(), which returns UUID version 1 variant 1, which is timestamp based UUID with timestamp components stored in big-endian order. Therefore ordering is optimized for this particular combination of version and variant to achieve best index performance: The earlier in time UUID() was generated - the earlier it's sorted. This scripts demonstrates ordering:
The underlying code could be extended to analyse UUID value fragments responsible for version/variant and apply different byte shuffling (or no shuffling at all) for different version/variant combinations. However, there is no a need to switch back to BINARY(16) to sovle the ordering problem.
| |||||||||||||||||
| Comment by Rick Strafy [ 2022-12-01 ] | |||||||||||||||||
|
I understand, but I still think this magical byte shuffling should be optional, like UUID(something), because it's very unexpected and I think no other DB system supporting UUID will do that, so people can easily store any 128 bit value as uuid (ULIDs for example). UUID types are designed as db-independent, I believe there are more users that are generating UUID somewhere else, and inserting it into DB, this is very important advantage od that type, it could be generated anywhere and not rely on auto increment in db, so created "entity" is valid even before insertion to database and you know its id. I tried using CAST(id AS BINARY), but it's 10 000x slower than normal sort in table with 6 000 000 records, 3.1481s vs 0.0003s, so I'm using datetime column for now, It's not so accurate but so far I can live with that. I also tried migrating to another DB platform that sorts uuid like binary, but there were much slower simple selects, so it wasn't worth it. So I will wait for a solution (I'm using UUID v7 now, so even version checking will help me), I see more people are starting switching from binary to uuid and have the same problem - https://dba.stackexchange.com/questions/315494/mariadb-uuid-datatype-not-orders-correctly | |||||||||||||||||
| Comment by Sergei Golubchik [ 2022-12-02 ] | |||||||||||||||||
|
It looks like we can implement a solution that automatically swaps bits in UUID versions 5 and below, and does not swap in versions 6 and above. But as far as I can see it will make it totally impossible to store UUID v8 or above with variant 0. It doesn't look like a combination people would use though, variant 0 is from 1988 and was obsoleted long before version 8 was introduced, so may be it's a limitation we can live with. Alternatively, it could be a different type where the version is specified explicitly (for example, UUID(7) or UUIDv7) which enforces the correct version, but this looks less usable to me | |||||||||||||||||
| Comment by Anton [ 2022-12-18 ] | |||||||||||||||||
|
I can confirm that this is a big problem if I want to use modern uuid v6 or v7. Earlier such behavior was ok, but nowadays since we have modern lexicographic sortable uuids it's absolutely needs to be at least optional behavior. | |||||||||||||||||
| Comment by Trevor Gross [ 2022-12-21 ] | |||||||||||||||||
|
What would a solution that automatically swaps <=v5 look like? As in, some rows would be reordered and others not (this sounds scary) or the `uuid()` function produces reordered v1 (i.e., v6) UUIDs? I personally am +1 on a different datatype rather than any sort of automatic reordering, something like `UUID(1)` or `UUID_FIXED`. Or renaming the current type to `UUID_REORDER_V1` and re-adding a `UUID` type that doesn't reorder, if that's possible in some way. The reordering thing was awesome when it came out and definitely fit a need, it's really too bad that it seems like a decision that came back to bite a bit. It will be interesting to see if ISO SQL develops a spec for this after v6-8 UUIDs are finalized, if so it would be good to be compliant (I'm sure some of the MariaDB team would be aware of this) | |||||||||||||||||
| Comment by Anton [ 2022-12-21 ] | |||||||||||||||||
|
Different datatypes is a wrong way because different uuids versions should be compatible with each others. UUID datatype should accept all version uuid at the same time. | |||||||||||||||||
| Comment by Rick Strafy [ 2022-12-30 ] | |||||||||||||||||
|
I think the best move will be to remove that behavior completely and make BC break for the sake of the majority of users. Also it will be expected and compatible with PostgreSQL and other DB platforms. Because I doubt there are people using latest MariaDB 10.7+ which isn't even LTS, together with oldest UUID version, in the time MariaDB 10.7 came out, there were already lexicographic options like ULID for primary keys. Another thing is as I mentioned - custom UUID versions and ULIDs will not work, if you want to reassemble bits based on version position in UUID string. We are humans, we all make mistakes or rash decisions, but there is still option to take it back, it's not too late - since there is no LTS yet, and most people are using only LTS versions. Version 10.11 is only in RC state, there's still a time. Also I can see there is 11.0.0 alpha, maybe the best time to do it there. That reassembling feature is today completely obsolete, when people have the option to use newer UUID versions instead of v1, and legacy apps with v1 UUID did not have ti sorted anyway, so if they were using it as PK and faced performance issue, maybe they were reordering bits before insertion/read on their backend. | |||||||||||||||||
| Comment by Rick Strafy [ 2023-01-16 ] | |||||||||||||||||
|
Any progress on this issue? | |||||||||||||||||
| Comment by Rick Strafy [ 2023-01-30 ] | |||||||||||||||||
|
I would appreciate some response or eta of resolving this bug, I have new version of web app with UUIDv7 migration ready and I'm waiting for months, I'm considering now doing rewrite to bigint as pk, because I want to be app as fast as possible without db fragmentation, since there are tables with a >10 mil. rows and there are intensive api calls. | |||||||||||||||||
| Comment by Daniel Black [ 2023-01-31 ] | |||||||||||||||||
|
serg - I did a first attempt covering the storage - https://github.com/MariaDB/server/pull/2468 Additional guidance on where and how to implement a migration appreciated. | |||||||||||||||||
| Comment by Illia Somov [ 2023-04-27 ] | |||||||||||||||||
|
Please get this resolved. This is ridiculous behavior, we cannot use native MariaDB UUIDs in our project because of it. I see how you can solve it in a way without BC break: just add some sort of parameter to UUID type e.g. `UUID(rearrange=true)`, dunno, which defaults to `false` in new UUIDs, but for existing databases it's `true` to prevent breaking existing data. | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-05-17 ] | |||||||||||||||||
|
introducing UUID type that only reorders bytes for version <= 5 and using UUID "old" logic (that always reorders) for already existing tables — this is easy, I've attached commits that do it. But to teach MariaDB how to convert old UUID to the new one (which is needed for joins, UNION, even ALTER TABLE) — this requires non-trivial refactoring of the data type plugin framework. It'll need some time. | |||||||||||||||||
| Comment by Trevor Gross [ 2023-05-17 ] | |||||||||||||||||
|
Is the plan to have a new UUID type (`uuid2` or similar) that detects version upon insert and reorders <=v5, then does the reverse upon select? Meaning that if you insert some v1 and some v7 UUIDs into the same column, some will be reordered before store and others won't? Auto-rearranging would work nicely for anyone who a combination of v1 with v6 or v7 - but it seems like unneeded work for all other cases, and could have some unintuitive results (e.g. columns that have a mix of any other types for whatever reason). Especially since the isn't a point in reordering v3/v4/v5 either. I would just prefer a new type that doesn't ever reorder (`binary(16)` with hex/unhex and formatting) instead of one that tries to automatically rearrange some types of UUIDs. This way is just easier to understand, and it makes sense for everyone not using v1 - which will probably eventually become the norm, since v7/v7 are meant to supercede v1. | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-05-18 ] | |||||||||||||||||
|
The plan is to fix existing UUID, not to add a new type. I'd prefer to have one type that always works in all cases, instead of forcing user to choose which UUID type to use in every specific case. | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-06-27 ] | |||||||||||||||||
|
see commits 83d97632bb91..68ba39d0b0b (in the bb-11.1-serg branch) | |||||||||||||||||
| Comment by Illia Somov [ 2023-06-28 ] | |||||||||||||||||
|
Hi, Sergei. Just want to make sure there is no misunderstanding. In the commits you've pushed and which were automatically mentioned above, you change the behavior such that UUID bytes are not swapped only for version 8. Personally I don't understand such choice. The original issue description specifically mentions UUIDs version 6 and version 7 which are broken by the current MariaDB behavior, but those seem to not be targeted by your fix. UUIDv8 is a version reserved for custom implementation of a UUID which can depend on a specific implementation, whereas UUIDs v6 and v7 are time-based and already have their bytes swapped for database storage. It's v6 and v7 which are the main source of the problem, not v8. More to that, the only UUID version for which it makes sense to reorder the bytes is UUIDv1 (UPD: and v2), which is the only "old" time-based UUID (by "old" here I mean defined in RFC 4122, prior to this RFC draft: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html), while the other versions ( | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-06-28 ] | |||||||||||||||||
|
Sorry, it was a typo in the comment (the commit went through a lot of iterations over the last month).
And from[6] < 0x60 condition means "version < 6" Thanks for spotting this, I'll update the comment when I recommit | |||||||||||||||||
| Comment by Illia Somov [ 2023-06-28 ] | |||||||||||||||||
|
Oh, good to know, thank you! What do you think about not swapping the bytes for other versions ( | |||||||||||||||||
| Comment by Illia Somov [ 2023-06-28 ] | |||||||||||||||||
|
UPD: I am sorry, it seems UUIDv2 is also time-based. It's not actively used (if used at all) in the wild and I got a bit confused. But v3, v4 and v5 are certainly not time-based. | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-06-28 ] | |||||||||||||||||
|
It's for backward compatibility. Supposedly v3/v4/v5 are used for a long time now. And it doesn't hurt to swap bytes there, so better to keep them as they were. If I'd do a new implementation from scratch, then, perhaps, I'd only swap bytes in v1 and v2. | |||||||||||||||||
| Comment by Evgenii [ 2023-07-05 ] | |||||||||||||||||
|
@serg | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-07-05 ] | |||||||||||||||||
|
Yes | |||||||||||||||||
| Comment by Alexander Barkov [ 2023-07-06 ] | |||||||||||||||||
|
This patch is OK to push: https://github.com/MariaDB/server/commit/ef84f8137b7c150635c23e1493f8620a4c8527ef | |||||||||||||||||
| Comment by Evgenii [ 2023-07-09 ] | |||||||||||||||||
|
@serg | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-07-09 ] | |||||||||||||||||
|
Yes. See the home page of https://jira.mariadb.org | |||||||||||||||||
| Comment by Evgenii [ 2023-08-15 ] | |||||||||||||||||
|
@serg Good afternoon, I tried the latest release version 10.11.5 which was released yesterday on August 14th. This bug is present and incorrectly processes queries like "where id < uuid_v7". How to reproduce: id;model_value If i run query "SELECT id, model_value FROM test ORDER BY ID DESC" - everything works correctly and i got all this rows. But if i run query "SELECT id, model_value FROM test where ID < '0189f81c-c795-71d0-bc3f-8b7b74abe044' ORDER BY ID DESC", then i got the result id;model_value The "model_value" column shows that some rows are missing - 19, 18, 16, 13, 11, 8, 7, 4, 2, 1 @serg Is there a quick fix for this bug? I have been waiting for this feature since the end of 2022 | |||||||||||||||||
| Comment by Sergei Golubchik [ 2023-08-15 ] | |||||||||||||||||
|
xpun, this is a new bug, next time (if there will be one) please, report it separately, otherwise it'll likely be missed. This time, as I've noticed it, I've reported it as |