[MDEV-4929] Add Myanmar (Burmese) collation Created: 2013-08-20  Updated: 2014-01-20  Resolved: 2014-01-15

Status: Closed
Project: MariaDB Server
Component/s: None
Fix Version/s: 10.0.7

Type: Task Priority: Minor
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 2
Labels: None

Attachments: File Index.xml.gz     File Index.xml.gz     JPEG File great_sa.jpg     Text File my100.txt     File sort.sql     PNG File sorting.png     Zip Archive sorting.zip     Zip Archive sorting.zip     JPEG File vowel O exception.jpg    
Issue Links:
Blocks
is blocked by MDEV-4928 Merge collation customization improve... Closed

 Description   

There is a long standing feature request from Myanmar users:

http://bugs.mysql.com/bug.php?id=22008

The users are still waiting for the collation.

This task needs MDEV-4928 to be merged from MySQL-5.6 first.



 Comments   
Comment by Sithu Thwin (Inactive) [ 2013-10-06 ]

I want to help. But I don't know how. I've already checkout latest source code from launchpad. Can anyone guide me how to and where to start ?

Comment by Alexander Barkov [ 2013-10-14 ]

Hi Sithu. Sorry for a late reply.

We're releasing 10.0.5 soon which will include "MDEV-4928 Collation customization improvements",
which is prerequisite for the Myanmar collation.
After the release I can give you instructions how to configure mysqld to support Myanmar
as a dynamically loadable collation, so you can test it. If everything is fine with it, then
we can include it into the next release (10.0.6) as a built-in collation, so it will work out
of the box.

Another option is not to wait for 10.0.5. You can download the 10.0.5 sources
from Launchpad lp:~maria-captains/maria/10.0 (see https://launchpad.net/maria/10.0 for details),
so we can start testing right now.

Please let me know when you're ready.
Thanks.

Comment by Sithu Thwin (Inactive) [ 2013-10-14 ]

I've already downloaded latest sources code from maria-captains launchpad
repo. Where and How do I start testing ?
Thanks
Sithu

On Mon, Oct 14, 2013 at 2:21 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-14 ]

Index.xml with the Myanmar collation definition.

Comment by Alexander Barkov [ 2013-10-14 ]

Please do the following steps:

  • Download Index.xml.gz attached in this issue.
  • gunzip Index.xml.gz
  • Go to /share/charsets directory of your MariaDB-10.0.5 installation
  • Save the existing Index.xml: mv Index.xml Index.xml.orig
  • Put the downloaded Index.xml instead of the old one
  • Restart MariaDB server
  • Run this query:
    SHOW COLLATION LIKE 'utf8_m%';
    It should report the new collation utf8_myanmar_ci.
  • Try to create tables, insert some data and try sorting order.

Thanks.

Comment by Sithu Thwin (Inactive) [ 2013-10-14 ]

I've downloaded source codes using instruction on this link
https://mariadb.com/kb/en/Getting_the_MariaDB_Source_Code/
I got maria source tree. I found VERSION file in it and it shows 10.0.4.
I branched with command bzr branch lp:maria/10.0 . I got new dir 10.0 and
it's VERSION file still 10.0.4.
I build mariadb server in 10.0 branch and installed. MySql version still
show 10.0.4.
I replaced index.xml and when I run SHOW COLLATION LIKE 'utf8_m%';
I got

Empty set (0.00 sec)

in mysql prompt.
What am I doing wrong?
Thanks,
Sithu

On Mon, Oct 14, 2013 at 3:55 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-15 ]

Please try this SQL script:

drop table if exists t1;
create table t1 (a varchar(10) character set utf8 collate utf8_myanmar_ci);
show warnings;

What does "show warnings" return?

Thanks.

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

Which database I should use? mysql or information_scheme ?
Thanks

On Tue, Oct 15, 2013 at 12:59 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-15 ]

Non of them. Please use "test" or some other non-system database you have access to.

Comment by Alexander Barkov [ 2013-10-15 ]

Sorry, the "SHOW WARNINGS" output in the above comment does not look right.

I guess something happened during copy-and-paste.
Can you please paste again?

Comment by Alexander Barkov [ 2013-10-15 ]

Did you restart the server after replacing Index.xml ?

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

Sorry first output is not the right one. It is before replacing
Index.xml.

Here is after replacing Index.xml

MariaDB [mysql]> drop table if exists t1;
Query OK, 0 rows affected, 1 warning (0.00 sec)

MariaDB [mysql]> create table t1 (a varchar(10) character set utf8 collate
utf8_myanmar_ci);
Query OK, 0 rows affected (0.35 sec)

MariaDB [mysql]> show warnings;Empty set (0.00 sec)

MariaDB [mysql]>

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

I run killall mysqld before starting mysql.
Here is output on test database.

MariaDB [test]> drop table if exists t1;
Query OK, 0 rows affected (0.18 sec)

MariaDB [test]> create table t1 (a varchar(10) character set utf8 collate utf8_myanmar_ci);
Query OK, 0 rows affected (0.38 sec)

MariaDB [test]> show warnings;
Empty set (0.00 sec)

MariaDB [test]>

Comment by Alexander Barkov [ 2013-10-15 ]

So it worked fine. The table has been created.
You can now insert some data into it and try sorting.

I guess "SHOW COLLATION LIKE 'utf8_m%';" now will also report utf8_myanmar_ci
without problems.

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

Ah, It's working now. Here is the output for SHOW COLLATION LIKE 'utf8_m%';

MariaDB [test]> SHOW COLLATION LIKE 'utf8_m%';
-------------------------------------------------+

Collation Charset Id Default Compiled Sortlen

-------------------------------------------------+

utf8_myanmar_ci utf8 220     8

-------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [test]>

I will start testing and checking sorting will work or not. Will report the result.

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

Some words not correctly sorted. I will check and make change.
BTW, in myanmar, we have more than 4 ethnic language with different sorting or encoding using same characters. I want know wheather those language can be added in collation.

Thanks

Comment by Alexander Barkov [ 2013-10-15 ]

> Some words not correctly sorted.

Can you please give an example of what is not sorted correctly?
Please dump your table and attach the dump here into the issue,
so I can reproduce it on my machine.

> I will check and make change.

Are you going to change Index.xml ?
It might be tricky. The Myanmar collation definition is quite complex.
Can you please send the dump first?

> BTW, in myanmar, we have more than 4 ethnic language with different sorting or encoding using same characters.
> I want know wheather those language can be added in collation.

I need to check. What are these languages?

Comment by Sithu Thwin (Inactive) [ 2013-10-15 ]

See attached screenshot. Top word ၎င်း (u104e + u1004 + u103a + u1038)
should not be sorted as first word. That word has at least 3 form and
sometime spell even with Myanmar Digit 4. I've attacted my testing script
and sql dump file. Sql dump's content is taken from hunspell my_MM.dic.

It might also have some other issues, I will check with Myanmar DIctionary
Sorting guide and report if found something incorrectly sorting.

Other languages are Shan, Mon, Kayin, Kayah(Kayini).

Thanks.

On Tue, Oct 15, 2013 at 4:21 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-15 ]

The collation engine is currently limited to 6 code points in a single collation element.
The first word is not sorted correctly because it should be sorted near:

u+101C u+100A u+103A u+1038 u+1000 u+1031 u+102C u+1004 u+103A u+1038

which makes 10 code points. I'll check how to fix this.

Is anything else sorted in a wrong way?

Comment by Sithu Thwin (Inactive) [ 2013-10-17 ]

Images in sorting.zip are sorting guide which is scanned from Myanmar
Dictionary 5 Books series printed/published around 1970. More than 30,000
people participated in collecting words for that books. The best reference
book in Myanmar.

Great SA (ဿ u+103F) must be sorted exactly after SA ( သ u+101E) group.
(ဌ္ဋ u+101C u+1039 u+100B) must be sorted exactly after ( ဌ u+101C) group.
It is just one letter with the combination of 3 glymps.

In sql dump file, consonants, syllabus, medial tables are written and
ordered as in the images from sorting.zip.

ဧ U+1027 has two different ways to sort how the word sound.

for example in the word ဧချင်း (u+1027 u+1001 u+103B u+1004 u+103A u+1038)
it treated as အေး (u+1021 u+1031 u+1038) ။
in ဧရာ it treated as အေ (u+1021 u+1031)
I think this cannot be sorted without wordlist dictionary.

On Tue, Oct 15, 2013 at 5:46 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-17 ]

Thanks for the information.

I have a question about the order in the table syllables:

From my understanding the record with sid=27 must be greater than the record sid=108.

Record with sid=108 is consonant u+1021 followed by vovel "u+102C u+101A u+103A".
Record with sid=27 is consonant u+1021 followed by vovel "u+102D u+102F u+101A u+103A".

The collation defition was taken from the Common Locale Data Repository:
http://unicode.org/repos/cldr/tags/release-23/common/collation/my.xml
(Please download this file for reference).

According to this collation definition, the above two vowels
are defined in the same big group, relatively to u+1034,
in this order:

<reset>\u1034</reset> (Index.xml:461, my.xml:37)
...
<s>\u102C\u101A\u103A</s> Index.xml:927, my.xml:503)
...
<s>\u102D\u102F\u101A\u103A</s> Index.xml:941, my.xml:517)

which means \u102C\u101A\u103A is smaller than \u102D\u102F\u101A\u103A,
which means the record sid=27 is bigger than the record sid=108.

Can you please confirm this?
It seems "sid" is not in alphabetic order in the table "syllables".

Also, I found that sid 28, 29,30,31, 32 are also not in the alphabetic order:

The records should be in this ascending order:

sid Code points where defined
109 1021 + 101B 103A my.xml:519
110 1021 + 102C 101B 103A my.xml:521
111 1021 + 1031 101B 103A my.ml:529
28 1021 + 102D 102F 101B 103A my.xml:535
112 1021 + 101C 103A my.xml:537
29 1021 + 102D 102F 101C 103A my.xml:553
113 1021 + 101E 103A my.xml:563
30 1021 + 102D 102F 101E 103A my.xml:580
114 1021 + 101F 103A my.xml:582
31 1021 + 102D 102F 101F 103A my.xml:598
32 1021 + 102D 102F 1020 103A my.xml:607

Is that correct?

Comment by Sithu Thwin (Inactive) [ 2013-10-18 ]

See the attachment image, I've written unicode code point reference in
image.
Your understanding is right according to unicode.org collation definition.
But some part might be wrong with that collation definition.
For example u+1034 is never used in Myanmar Language. It is Mon vowel O. So
everything in that group might be related with Mon Language.

Myanmar Language sorting is mostly base on phonetic order. Sometime
phonetic order and alphabetic order are not the same.

regards,
Sithu

On Thu, Oct 17, 2013 at 6:31 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-18 ]

So are you Ok with this sorting order:

sid Code points where defined
109 1021 + 101B 103A my.xml:519
110 1021 + 102C 101B 103A my.xml:521
111 1021 + 1031 101B 103A my.ml:529
28 1021 + 102D 102F 101B 103A my.xml:535
112 1021 + 101C 103A my.xml:537
29 1021 + 102D 102F 101C 103A my.xml:553
113 1021 + 101E 103A my.xml:563
30 1021 + 102D 102F 101E 103A my.xml:580
114 1021 + 101F 103A my.xml:582
31 1021 + 102D 102F 101F 103A my.xml:598
32 1021 + 102D 102F 1020 103A my.xml:607

If we fix the other problems and keep these rules for 103A,
will such collation work fine for you?

Comment by Sithu Thwin (Inactive) [ 2013-10-21 ]

I'm inviting language and IT professionals to discuss in this ticket. So
wait for their response or I will reply with more detail information about
Myanmar sorting. I'm trying to contact with Myanmar language professors or
tutors from university.

regards,
Sithu

On Sat, Oct 19, 2013 at 12:39 AM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-10-30 ]

select words from km_alphabet where id <100 order by id

Comment by Alexander Barkov [ 2013-10-30 ]

Hello Sithu,
Any news about the correct Myanmar order?

In the meanwhile I made an experiment:
1. I created a file with the results of this SQL query:

select words from km_alphabet where id <100 order by id;
(see attached).

2. Opened this link (ICU locale explorer for the Myanmar collation):
http://demo.icu-project.org/icu-bin/locexp?_=my_MM&d_=en&x=col

3. Copied the file into clipboard and pasted it into the "source" section.
4. Checked the "Hide Collation Key" (for simpler results output) checkbox.
5. Pressed the "Sort" button
6. Checked the results in the "Original" and "Collated" fields.

The results in the "Collated" field appeared in this order:

02: က
04: ကကတိုး
03: ကကတစ်
05: ကကူရံ
09: ကကြိုးတန်ဆာ
10: ကကြိုးတန်ဆာဆင်
11: ကခုန်
14: ကချေသည်
15: ကချော်ကချင်
16: ကချော်ကချွတ်
12: ကချင်
13: ကချင်ပြည်နယ်
64: ကစား
65: ကစားဒိုင်
66: ကစားဝိုင်း
67: ကစားသမား
68: ကစီ
69: ကစီတင်
70: ကစီရည်
71: ကစော်
61: ကစစ်
63: ကစဉ်ကလျား
62: ကစဉ့်ကလျား
74: ကစွဲကစောင်း
72: ကစွန်း
73: ကစွန်းဥ
75: ကဆုန်
76: ကဆွဲ့ကနွဲ့
85: ကညာ
77: ကညင်
78: ကညင်ဆီ
79: ကညင်တိုင်
80: ကညင်နီ
81: ကညင်ပျံ
82: ကညင်ဖြူ
83: ကညစ်
84: ကညစ်သွား
86: ကညွတ်
87: ကညှပ်
94: ကတိ
95: ကတိကဝတ်
96: ကတိခံ
97: ကတိစောင့်
98: ကတိတည်
99: ကတိထား
91: ကတင်
92: ကတည်းက
93: ကတန်းကရမ်;
01: ကို
08: ကက်
06: ကက္ကုကမည်းပွင့်
07: ကက္ကော်တကန်
17: ကင်
60: ကင်္ကာ
18: ကင်ညှပ်
19: ကင်ပလစ်
20: ကင်ပျစ်
21: ကင်ပွန်း
22: ကင်း
23: ကင်းကစီ
24: ကင်းကိုင်
25: ကင်းကွာ
26: ကင်းခိုး
27: ကင်းချုပ်
28: ကင်းခြေများ
31: ကင်းခွေး
29: ကင်းခွန်
30: ကင်းခွန်ကင်းခ
32: ကင်းစ
33: ကင်းစား
34: ကင်းစီး
36: ကင်းစောင့်
37: ကင်းစောင့်ထား
35: ကင်းစုန်း
40: ကင်းတဲ
38: ကင်းတပ်
39: ကင်းတပ်ဥပဒေ
41: ကင်းထား
42: ကင်းထိုး
43: ကင်းထိုးမြင်းစီး
44: ကင်းထောက်
45: ကင်းနားသန်
46: ကင်းပုစွန်
47: ကင်းဖလောင်ကောင်
48: ကင်းဘူ
49: ကင်းမလက်မည်း
50: ကင်းမြီးကောက်
51: ကင်းရုံ
53: ကင်းလုလင်
52: ကင်းလိပ်ချော
54: ကင်းဝန်
55: ကင်းဝန်းကင်းပတ်လှည့်
58: ကင်းသား
56: ကင်းသင်း
57: ကင်းသန်း
59: ကင်းအုပ်
88: ကဏ္ဍ
89: ကဏ္ဍဇာ
90: ကဏ္ဏမူ

Can you please check if this order is correct? Thanks.

We can reproduce the same order in MariaDB.
But if this order does not look correct, then the collation definition in CLDR is wrong.
We cannot add a collation without having a correct definition.

Comment by Sithu Thwin (Inactive) [ 2013-11-02 ]

Sorry for delay. My son is in hospital about 1 week because Dengue Fever
infected.

Above 100 words sorting order is correct. Because CLDR definition is
generally correct.
But some exception words might be wrong in some case.
I'm still looking for someone more professional about Myanmar Language.
Thanks

2013/10/30 Alexander Barkov (JIRA) <jira@mariadb.atlassian.net>

Comment by Alexander Barkov [ 2013-11-05 ]

Sithu, I wish your son get well soon.

Please tell me when you are ready to do some more testing.
I made a few enhancements in the collation code, and in
the collation definition.

Comment by Sithu Thwin (Inactive) [ 2013-11-05 ]

Thank you, my son is fine now.

I'm ready to test. Do I need to update bzr source code? I will update
source code in my computer tomorrow.

On Tue, Nov 5, 2013 at 3:52 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2013-11-12 ]

Index.xml, version 2.

Comment by Alexander Barkov [ 2013-11-12 ]

Hi Sithu,

Please build MariaDB from the latest revision from:
https://launchpad.net/maria/10.0
and put the new version of Index.xml which I've just attached.
Then create your tables and test sorting order again.
Thanks.

Comment by Alexander Barkov [ 2013-11-18 ]

Hi Sithu, did you have a chance to try the latest version? Thanks.

Comment by Sithu Thwin (Inactive) [ 2013-11-22 ]

Internet connection in my country is very slow. I'm trying to update
bzr source code. It never complete. So I can't test till now.
I will report as soon as I've updated my source code and send test
result. I'm trying to figure out to fix internet connection.

Thanks.

Comment by Sithu Thwin (Inactive) [ 2013-12-03 ]

sorry for long time waiting. I've already tested new version. I don't see any improvement. MariaDB version is 10.0.7 .
Which parts did you change in Index.xml? ၎င်း (u104e + u1004 + u103a + u1038) is still at the first place in sorting.

Comment by Alexander Barkov [ 2013-12-16 ]

Make sure to use the version uploaded on 2013-11-12.

There are two differences:
1. This line makes it use Unicode-5.2.0, which has more Myanmar characters defined for sorting.
<collation name="utf8_myanmar_ci" id="220" shift-after-method="expand" version="5.2.0">

2. This line is not commented anymore:
<reset>\u101C\u100A\u103A\u1038\u1000\u1031\u102C\u1004\u103A\u1038</reset><s>\u104E\u1004\u103A\u1038</s>

Please try this query:
mysql> SELECT HEX,HEX(CONVERT(x USING utf8)) as utf8,HEX(WEIGHT_STRING(CONVERT(t1.x USING utf8) COLLATE utf8_myanmar_ci)) AS weight FROM (SELECT _ucs2 X'101C100A103A103810001031102C1004103A1038' AS x UNION SELECT _ucs2 X'104E1004103A1038') AS t1;

The expected result is:
----------------------------------------------------------------------------------------------------------------------------------------------

HEX utf8 weight

----------------------------------------------------------------------------------------------------------------------------------------------

101C100A103A103810001031102C1004103A1038 E1809CE1808AE180BAE180B8E18080E180B1E180ACE18084E180BAE180B8 220D22483B1322593ACC21CD22483AEE22593ACC
104E1004103A1038 E1818EE18084E180BAE180B8 220D22483B1322593ACC21CD22483AEE22593ACC

----------------------------------------------------------------------------------------------------------------------------------------------

Notice, the "weight" value is the same in the two records.
If you get something else, you're probably still using the old version of Index.xml.

Comment by Sithu Thwin (Inactive) [ 2013-12-17 ]

Today, I've tested again. Everything work correctly. Sorting rules are
correct according to Unicode Collation Standard.
Thanks
Sithu

On Mon, Dec 16, 2013 at 6:11 PM, Alexander Barkov (JIRA) <

Comment by Sithu Thwin (Inactive) [ 2014-01-09 ]

I hope Myanmar collation will be included in mariadb 10.* final release

On Tue, Dec 17, 2013 at 2:00 PM, Sithu Thwin (JIRA) <

Comment by Alexander Barkov [ 2014-01-10 ]

Hi Sithu.

It was included into 10.0.7.

See the full changelog here:

https://mariadb.com/kb/en/mariadb-1007-changelog/

Thanks for your help to make this happen!

Comment by Sithu Thwin (Inactive) [ 2014-01-10 ]

I'm very happy and all Myanmar web developers will be happy
to hear this news.
Can we use MariaDB 10.0.7 on normal WordPress production site? Or do we
need to wait for final release ?
Thank you very much,
Sithu

On Fri, Jan 10, 2014 at 4:17 PM, Alexander Barkov (JIRA) <

Comment by Alexander Barkov [ 2014-01-20 ]

Maria-10.0 is now in beta, which means it can still have some critical
bugs in the new 10.0 code, as well as in the code merged from MySQL-5.6.

10.0 is targeted to be GA in summer 2014.
If you don't use any other new 10.0 features (other than the Myanmar collation)
then it should be safe to use 10.0 beta.

Generated at Thu Feb 08 07:00:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.