Details
-
Task
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
There is a long standing feature request from Myanmar users:
http://bugs.mysql.com/bug.php?id=22008
The users are still waiting for the collation.
This task needs MDEV-4928 to be merged from MySQL-5.6 first.
Attachments
- great_sa.jpg
- 1.80 MB
- Index.xml.gz
- 7 kB
- Index.xml.gz
- 7 kB
- my100.txt
- 2 kB
- sort.sql
- 777 kB
- sorting.png
- 15 kB
- sorting.zip
- 4.11 MB
- sorting.zip
- 132 kB
- vowel O exception.jpg
- 1.28 MB
Issue Links
- is blocked by
-
MDEV-4928 Merge collation customization improvements
-
- Closed
-
Activity
Hi Sithu. Sorry for a late reply.
We're releasing 10.0.5 soon which will include "MDEV-4928 Collation customization improvements",
which is prerequisite for the Myanmar collation.
After the release I can give you instructions how to configure mysqld to support Myanmar
as a dynamically loadable collation, so you can test it. If everything is fine with it, then
we can include it into the next release (10.0.6) as a built-in collation, so it will work out
of the box.
Another option is not to wait for 10.0.5. You can download the 10.0.5 sources
from Launchpad lp:~maria-captains/maria/10.0 (see https://launchpad.net/maria/10.0 for details),
so we can start testing right now.
Please let me know when you're ready.
Thanks.
I've already downloaded latest sources code from maria-captains launchpad
repo. Where and How do I start testing ?
Thanks
Sithu
On Mon, Oct 14, 2013 at 2:21 PM, Alexander Barkov (JIRA) <
–
Please do the following steps:
- Download Index.xml.gz attached in this issue.
- gunzip Index.xml.gz
- Go to /share/charsets directory of your MariaDB-10.0.5 installation
- Save the existing Index.xml: mv Index.xml Index.xml.orig
- Put the downloaded Index.xml instead of the old one
- Restart MariaDB server
- Run this query:
SHOW COLLATION LIKE 'utf8_m%';
It should report the new collation utf8_myanmar_ci. - Try to create tables, insert some data and try sorting order.
Thanks.
I've downloaded source codes using instruction on this link
https://mariadb.com/kb/en/Getting_the_MariaDB_Source_Code/
I got maria source tree. I found VERSION file in it and it shows 10.0.4.
I branched with command bzr branch lp:maria/10.0 . I got new dir 10.0 and
it's VERSION file still 10.0.4.
I build mariadb server in 10.0 branch and installed. MySql version still
show 10.0.4.
I replaced index.xml and when I run SHOW COLLATION LIKE 'utf8_m%';
I got
Empty set (0.00 sec)
in mysql prompt.
What am I doing wrong?
Thanks,
Sithu
On Mon, Oct 14, 2013 at 3:55 PM, Alexander Barkov (JIRA) <
–
Please try this SQL script:
drop table if exists t1;
create table t1 (a varchar(10) character set utf8 collate utf8_myanmar_ci);
show warnings;
What does "show warnings" return?
Thanks.
Which database I should use? mysql or information_scheme ?
Thanks
On Tue, Oct 15, 2013 at 12:59 PM, Alexander Barkov (JIRA) <
–
Non of them. Please use "test" or some other non-system database you have access to.
Sorry, the "SHOW WARNINGS" output in the above comment does not look right.
I guess something happened during copy-and-paste.
Can you please paste again?
Sorry first output is not the right one. It is before replacing
Index.xml.
Here is after replacing Index.xml
MariaDB [mysql]> drop table if exists t1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
MariaDB [mysql]> create table t1 (a varchar(10) character set utf8 collate
utf8_myanmar_ci);
Query OK, 0 rows affected (0.35 sec)
MariaDB [mysql]> show warnings;Empty set (0.00 sec)
MariaDB [mysql]>
–
I run killall mysqld before starting mysql.
Here is output on test database.
MariaDB [test]> drop table if exists t1;
Query OK, 0 rows affected (0.18 sec)
MariaDB [test]> create table t1 (a varchar(10) character set utf8 collate utf8_myanmar_ci);
Query OK, 0 rows affected (0.38 sec)
MariaDB [test]> show warnings;
Empty set (0.00 sec)
MariaDB [test]>
So it worked fine. The table has been created.
You can now insert some data into it and try sorting.
I guess "SHOW COLLATION LIKE 'utf8_m%';" now will also report utf8_myanmar_ci
without problems.
Ah, It's working now. Here is the output for SHOW COLLATION LIKE 'utf8_m%';
MariaDB [test]> SHOW COLLATION LIKE 'utf8_m%';
-------------------------------------------------+
Collation | Charset | Id | Default | Compiled | Sortlen |
-------------------------------------------------+
utf8_myanmar_ci | utf8 | 220 | 8 |
-------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [test]>
I will start testing and checking sorting will work or not. Will report the result.
Some words not correctly sorted. I will check and make change.
BTW, in myanmar, we have more than 4 ethnic language with different sorting or encoding using same characters. I want know wheather those language can be added in collation.
Thanks
> Some words not correctly sorted.
Can you please give an example of what is not sorted correctly?
Please dump your table and attach the dump here into the issue,
so I can reproduce it on my machine.
> I will check and make change.
Are you going to change Index.xml ?
It might be tricky. The Myanmar collation definition is quite complex.
Can you please send the dump first?
> BTW, in myanmar, we have more than 4 ethnic language with different sorting or encoding using same characters.
> I want know wheather those language can be added in collation.
I need to check. What are these languages?
See attached screenshot. Top word ၎င်း (u104e + u1004 + u103a + u1038)
should not be sorted as first word. That word has at least 3 form and
sometime spell even with Myanmar Digit 4. I've attacted my testing script
and sql dump file. Sql dump's content is taken from hunspell my_MM.dic.
It might also have some other issues, I will check with Myanmar DIctionary
Sorting guide and report if found something incorrectly sorting.
Other languages are Shan, Mon, Kayin, Kayah(Kayini).
Thanks.
On Tue, Oct 15, 2013 at 4:21 PM, Alexander Barkov (JIRA) <
–
The collation engine is currently limited to 6 code points in a single collation element.
The first word is not sorted correctly because it should be sorted near:
u+101C u+100A u+103A u+1038 u+1000 u+1031 u+102C u+1004 u+103A u+1038
which makes 10 code points. I'll check how to fix this.
Is anything else sorted in a wrong way?
Images in sorting.zip are sorting guide which is scanned from Myanmar
Dictionary 5 Books series printed/published around 1970. More than 30,000
people participated in collecting words for that books. The best reference
book in Myanmar.
Great SA (ဿ u+103F) must be sorted exactly after SA ( သ u+101E) group.
(ဌ္ဋ u+101C u+1039 u+100B) must be sorted exactly after ( ဌ u+101C) group.
It is just one letter with the combination of 3 glymps.
In sql dump file, consonants, syllabus, medial tables are written and
ordered as in the images from sorting.zip.
ဧ U+1027 has two different ways to sort how the word sound.
for example in the word ဧချင်း (u+1027 u+1001 u+103B u+1004 u+103A u+1038)
it treated as အေး (u+1021 u+1031 u+1038) ။
in ဧရာ it treated as အေ (u+1021 u+1031)
I think this cannot be sorted without wordlist dictionary.
On Tue, Oct 15, 2013 at 5:46 PM, Alexander Barkov (JIRA) <
–
Thanks for the information.
I have a question about the order in the table syllables:
From my understanding the record with sid=27 must be greater than the record sid=108.
Record with sid=108 is consonant u+1021 followed by vovel "u+102C u+101A u+103A".
Record with sid=27 is consonant u+1021 followed by vovel "u+102D u+102F u+101A u+103A".
The collation defition was taken from the Common Locale Data Repository:
http://unicode.org/repos/cldr/tags/release-23/common/collation/my.xml
(Please download this file for reference).
According to this collation definition, the above two vowels
are defined in the same big group, relatively to u+1034,
in this order:
<reset>\u1034</reset> (Index.xml:461, my.xml:37)
...
<s>\u102C\u101A\u103A</s> Index.xml:927, my.xml:503)
...
<s>\u102D\u102F\u101A\u103A</s> Index.xml:941, my.xml:517)
which means \u102C\u101A\u103A is smaller than \u102D\u102F\u101A\u103A,
which means the record sid=27 is bigger than the record sid=108.
Can you please confirm this?
It seems "sid" is not in alphabetic order in the table "syllables".
Also, I found that sid 28, 29,30,31, 32 are also not in the alphabetic order:
The records should be in this ascending order:
sid | Code points | where defined |
109 | 1021 + 101B 103A | my.xml:519 |
110 | 1021 + 102C 101B 103A | my.xml:521 |
111 | 1021 + 1031 101B 103A | my.ml:529 |
28 | 1021 + 102D 102F 101B 103A | my.xml:535 |
112 | 1021 + 101C 103A | my.xml:537 |
29 | 1021 + 102D 102F 101C 103A | my.xml:553 |
113 | 1021 + 101E 103A | my.xml:563 |
30 | 1021 + 102D 102F 101E 103A | my.xml:580 |
114 | 1021 + 101F 103A | my.xml:582 |
31 | 1021 + 102D 102F 101F 103A | my.xml:598 |
32 | 1021 + 102D 102F 1020 103A | my.xml:607 |
Is that correct?
See the attachment image, I've written unicode code point reference in
image.
Your understanding is right according to unicode.org collation definition.
But some part might be wrong with that collation definition.
For example u+1034 is never used in Myanmar Language. It is Mon vowel O. So
everything in that group might be related with Mon Language.
Myanmar Language sorting is mostly base on phonetic order. Sometime
phonetic order and alphabetic order are not the same.
regards,
Sithu
On Thu, Oct 17, 2013 at 6:31 PM, Alexander Barkov (JIRA) <
–
So are you Ok with this sorting order:
sid | Code points | where defined |
109 | 1021 + 101B 103A | my.xml:519 |
110 | 1021 + 102C 101B 103A | my.xml:521 |
111 | 1021 + 1031 101B 103A | my.ml:529 |
28 | 1021 + 102D 102F 101B 103A | my.xml:535 |
112 | 1021 + 101C 103A | my.xml:537 |
29 | 1021 + 102D 102F 101C 103A | my.xml:553 |
113 | 1021 + 101E 103A | my.xml:563 |
30 | 1021 + 102D 102F 101E 103A | my.xml:580 |
114 | 1021 + 101F 103A | my.xml:582 |
31 | 1021 + 102D 102F 101F 103A | my.xml:598 |
32 | 1021 + 102D 102F 1020 103A | my.xml:607 |
If we fix the other problems and keep these rules for 103A,
will such collation work fine for you?
I'm inviting language and IT professionals to discuss in this ticket. So
wait for their response or I will reply with more detail information about
Myanmar sorting. I'm trying to contact with Myanmar language professors or
tutors from university.
regards,
Sithu
On Sat, Oct 19, 2013 at 12:39 AM, Alexander Barkov (JIRA) <
–
Hello Sithu,
Any news about the correct Myanmar order?
In the meanwhile I made an experiment:
1. I created a file with the results of this SQL query:
select words from km_alphabet where id <100 order by id;
(see attached).
2. Opened this link (ICU locale explorer for the Myanmar collation):
http://demo.icu-project.org/icu-bin/locexp?_=my_MM&d_=en&x=col
3. Copied the file into clipboard and pasted it into the "source" section.
4. Checked the "Hide Collation Key" (for simpler results output) checkbox.
5. Pressed the "Sort" button
6. Checked the results in the "Original" and "Collated" fields.
The results in the "Collated" field appeared in this order:
02: က
04: ကကတိုး
03: ကကတစ်
05: ကကူရံ
09: ကကြိုးတန်ဆာ
10: ကကြိုးတန်ဆာဆင်
11: ကခုန်
14: ကချေသည်
15: ကချော်ကချင်
16: ကချော်ကချွတ်
12: ကချင်
13: ကချင်ပြည်နယ်
64: ကစား
65: ကစားဒိုင်
66: ကစားဝိုင်း
67: ကစားသမား
68: ကစီ
69: ကစီတင်
70: ကစီရည်
71: ကစော်
61: ကစစ်
63: ကစဉ်ကလျား
62: ကစဉ့်ကလျား
74: ကစွဲကစောင်း
72: ကစွန်း
73: ကစွန်းဥ
75: ကဆုန်
76: ကဆွဲ့ကနွဲ့
85: ကညာ
77: ကညင်
78: ကညင်ဆီ
79: ကညင်တိုင်
80: ကညင်နီ
81: ကညင်ပျံ
82: ကညင်ဖြူ
83: ကညစ်
84: ကညစ်သွား
86: ကညွတ်
87: ကညှပ်
94: ကတိ
95: ကတိကဝတ်
96: ကတိခံ
97: ကတိစောင့်
98: ကတိတည်
99: ကတိထား
91: ကတင်
92: ကတည်းက
93: ကတန်းကရမ်;
01: ကို
08: ကက်
06: ကက္ကုကမည်းပွင့်
07: ကက္ကော်တကန်
17: ကင်
60: ကင်္ကာ
18: ကင်ညှပ်
19: ကင်ပလစ်
20: ကင်ပျစ်
21: ကင်ပွန်း
22: ကင်း
23: ကင်းကစီ
24: ကင်းကိုင်
25: ကင်းကွာ
26: ကင်းခိုး
27: ကင်းချုပ်
28: ကင်းခြေများ
31: ကင်းခွေး
29: ကင်းခွန်
30: ကင်းခွန်ကင်းခ
32: ကင်းစ
33: ကင်းစား
34: ကင်းစီး
36: ကင်းစောင့်
37: ကင်းစောင့်ထား
35: ကင်းစုန်း
40: ကင်းတဲ
38: ကင်းတပ်
39: ကင်းတပ်ဥပဒေ
41: ကင်းထား
42: ကင်းထိုး
43: ကင်းထိုးမြင်းစီး
44: ကင်းထောက်
45: ကင်းနားသန်
46: ကင်းပုစွန်
47: ကင်းဖလောင်ကောင်
48: ကင်းဘူ
49: ကင်းမလက်မည်း
50: ကင်းမြီးကောက်
51: ကင်းရုံ
53: ကင်းလုလင်
52: ကင်းလိပ်ချော
54: ကင်းဝန်
55: ကင်းဝန်းကင်းပတ်လှည့်
58: ကင်းသား
56: ကင်းသင်း
57: ကင်းသန်း
59: ကင်းအုပ်
88: ကဏ္ဍ
89: ကဏ္ဍဇာ
90: ကဏ္ဏမူ
Can you please check if this order is correct? Thanks.
We can reproduce the same order in MariaDB.
But if this order does not look correct, then the collation definition in CLDR is wrong.
We cannot add a collation without having a correct definition.
Sorry for delay. My son is in hospital about 1 week because Dengue Fever
infected.
Above 100 words sorting order is correct. Because CLDR definition is
generally correct.
But some exception words might be wrong in some case.
I'm still looking for someone more professional about Myanmar Language.
Thanks
2013/10/30 Alexander Barkov (JIRA) <jira@mariadb.atlassian.net>
–
Sithu, I wish your son get well soon.
Please tell me when you are ready to do some more testing.
I made a few enhancements in the collation code, and in
the collation definition.
Thank you, my son is fine now.
I'm ready to test. Do I need to update bzr source code? I will update
source code in my computer tomorrow.
On Tue, Nov 5, 2013 at 3:52 PM, Alexander Barkov (JIRA) <
–
Hi Sithu,
Please build MariaDB from the latest revision from:
https://launchpad.net/maria/10.0
and put the new version of Index.xml which I've just attached.
Then create your tables and test sorting order again.
Thanks.
Hi Sithu, did you have a chance to try the latest version? Thanks.
Internet connection in my country is very slow. I'm trying to update
bzr source code. It never complete. So I can't test till now.
I will report as soon as I've updated my source code and send test
result. I'm trying to figure out to fix internet connection.
Thanks.
sorry for long time waiting. I've already tested new version. I don't see any improvement. MariaDB version is 10.0.7 .
Which parts did you change in Index.xml? ၎င်း (u104e + u1004 + u103a + u1038) is still at the first place in sorting.
Make sure to use the version uploaded on 2013-11-12.
There are two differences:
1. This line makes it use Unicode-5.2.0, which has more Myanmar characters defined for sorting.
<collation name="utf8_myanmar_ci" id="220" shift-after-method="expand" version="5.2.0">
2. This line is not commented anymore:
<reset>\u101C\u100A\u103A\u1038\u1000\u1031\u102C\u1004\u103A\u1038</reset><s>\u104E\u1004\u103A\u1038</s>
Please try this query:
mysql> SELECT HEX,HEX(CONVERT(x USING utf8)) as utf8,HEX(WEIGHT_STRING(CONVERT(t1.x USING utf8) COLLATE utf8_myanmar_ci)) AS weight FROM (SELECT _ucs2 X'101C100A103A103810001031102C1004103A1038' AS x UNION SELECT _ucs2 X'104E1004103A1038') AS t1;
The expected result is:
----------------------------------------------------------------------------------------------------------------------------------------------
HEX![]() |
utf8 | weight |
----------------------------------------------------------------------------------------------------------------------------------------------
101C100A103A103810001031102C1004103A1038 | E1809CE1808AE180BAE180B8E18080E180B1E180ACE18084E180BAE180B8 | 220D22483B1322593ACC21CD22483AEE22593ACC |
104E1004103A1038 | E1818EE18084E180BAE180B8 | 220D22483B1322593ACC21CD22483AEE22593ACC |
----------------------------------------------------------------------------------------------------------------------------------------------
Notice, the "weight" value is the same in the two records.
If you get something else, you're probably still using the old version of Index.xml.
Today, I've tested again. Everything work correctly. Sorting rules are
correct according to Unicode Collation Standard.
Thanks
Sithu
On Mon, Dec 16, 2013 at 6:11 PM, Alexander Barkov (JIRA) <
–
I hope Myanmar collation will be included in mariadb 10.* final release
On Tue, Dec 17, 2013 at 2:00 PM, Sithu Thwin (JIRA) <
–
Hi Sithu.
It was included into 10.0.7.
See the full changelog here:
https://mariadb.com/kb/en/mariadb-1007-changelog/
Thanks for your help to make this happen!
I'm very happy and all Myanmar web developers will be happy
to hear this news.
Can we use MariaDB 10.0.7 on normal WordPress production site? Or do we
need to wait for final release ?
Thank you very much,
Sithu
On Fri, Jan 10, 2014 at 4:17 PM, Alexander Barkov (JIRA) <
–
Maria-10.0 is now in beta, which means it can still have some critical
bugs in the new 10.0 code, as well as in the code merged from MySQL-5.6.
10.0 is targeted to be GA in summer 2014.
If you don't use any other new 10.0 features (other than the Myanmar collation)
then it should be safe to use 10.0 beta.
I want to help. But I don't know how. I've already checkout latest source code from launchpad. Can anyone guide me how to and where to start ?