[CONJ-988] UTF-16 surrogates are incorrectly computed Created: 2022-07-05  Updated: 2022-07-26  Resolved: 2022-07-25

Status: Closed
Project: MariaDB Connector/J
Component/s: Other
Affects Version/s: 3.0.6
Fix Version/s: N/A

Type: Bug Priority: Blocker
Reporter: Axel Dörfler Assignee: Diego Dupin
Resolution: Not a Bug Votes: 0
Labels: None


 Description   

The code to compute the surrogate pair looks like this (in org.mariadb.jdbc.client.socket.impl.PacketWriter):

              int surrogatePairs =
                  ((currChar << 10) + nextChar) + (0x010000 - (0xD800 << 10) - 0xDC00);

According to the Unicode standard, this should look like this, however (https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates):

    code = 0x10000;
    code += (units[0] & 0x03FF) << 10;
    code += (units[1] & 0x03FF);

Not too surprisingly, the two computations don't come to the same results.
Example: \udbc0\udd89

public class MyClass {
    public static void main(String args[]) {
      char current=0xdbc0;
      char next=0xdd89;
      int c=10000;
      c+=(current & 0x3ff) << 10;
      c+=(next & 0x3ff);
 
    int surrogatePairs =
                  ((current << 10) + next) + (0x010000 - (0xD800 << 10) - 0xDC00);
 
      System.out.println(c+" VS. "+surrogatePairs);
    }
}



 Comments   
Comment by Diego Dupin [ 2022-07-25 ]

hmm. "surrogatePairs" is badly named, code point would have been more appropriate, and in fact:

    int surrogatePairs = ((current << 10) + next) + (0x010000 - (0xD800 << 10) - 0xDC00);

would be better replaced by :

  int codePoint = Character.toCodePoint(current , next);

nevermind, i'll change that.
i think this is probably just your example that is wrong : you just have to replace int c=10000; by int c=0x10000; and that must be ok.

Comment by Axel Dörfler [ 2022-07-25 ]

I can confirm that Character.toCodePoint() is the exact same code as your version. It does, however, not produce the same results that the computation in the standard does. Well, it actually does for 0x10000 as you mention, but that's just one unique case; it doesn't have the same solution for pretty much any other value.

Comment by Diego Dupin [ 2022-07-25 ]

I mean, this is equal in all cases :

currChar is in the range U+D800 to U+DBFF inclusive => currChar & 0x3ff stricly equals currChar - 0xD800
nextChar is in the range U+DC00 to U+DFFF inclusive => nextChar & 0x3ff stricly equals nextChar − 0xDC00

((currChar << 10) + nextChar ) + (0x010000 - (0xD800 << 10) - 0xDC00
= 0x10000 + ((currChar − 0xD800) << 10) + (nextChar − 0xDC00)
= 0x10000 + ((currChar & 0x3ff) << 10) + (nextChar & 0x3ff)
if you prefer

Comment by Axel Dörfler [ 2022-07-26 ]

You're absolutely right! Damn, I'm not sure what I was thinking yesterday; I used values different for 'c' different than 0x10000 (in the range from 0x10000 to 0x10ffff) to test, instead of using different surrogate pairs with the correct base. Sorry for the noise and my stupidity, at least the code got a bit cleaner as a result! Thanks for your patience!

Comment by Diego Dupin [ 2022-07-26 ]

no problem, issues are possible, and double checking is always a good idea !

Generated at Thu Feb 08 03:19:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.