Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mappings for Ext. E/F/G #12

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

JLHwung
Copy link

@JLHwung JLHwung commented Mar 19, 2021

Fixes #5

This PR adds mappings for the characters in the follwing blocks (as of Unicode 13)

  [0x2b820, 0x2cea1], // CJK Ideographs Extension E
  [0x2ceb0, 0x2ebe0], // CJK Ideographs Extension F
  [0x30000, 0x3134a] // CJK Ideographs Extension G

The mappings are copied from https://github.com/Jackchows/Cangjie5 with additional fixes in Jackchows/Cangjie5#209. Among the mappings, some mappings are deliberately discarded because it does not fit within current scope, specfically:

  • mapping starting with z for CJK Compatibility Ideographs and CJK Compatibility Ideographs Supplement
  • mapping starting with x (we don't have x mapping for CJK Ext. B)

The first commit fixes ordering issues in current mappings. It is an editorial fix and does not have observable behaviour changes.
The second commit and the third commit added new mappings ordered by cangjie code. The new mappings are appended to current mappings so the character frequency order is not affected.

When authoring this PR, I came up with two scripts, feel free to re-use it as Ext. H will be hopefully targeted to 2022. (link of scripts)

Current known issues:

use of rotational operator z in specific characters:

𮗙	buhuz
𰒥	izi
𫸪	nnz
𰨇	ozmmf
𰲞	yniz
𬢆	yzbuu

The author of https://github.com/Jackchows/Cangjie5 deliberately used Z (defined in Cangjie 6 as a rotation operator, see Section 14 for the rationale) to encode these 6 characters. However this is not consistent to what we already have for such characters in Ext. B

𠄏	ilv
𠄔	ilvv
𣀨	iiye

We have three solutions on addressing inconsistency here:

  1. Reach consensus on using z for specific new characters and add new mapping
𠄏	nnz
𠄔	ninz
𣀨	izye

The old mappings for 𠄏𠄔𣀨 will be preserved as compatibility mapping. The new mappings for 𮗙𰒥𫸪𰨇𰲞𬢆 is regarded. Both @LEOYoon-Tsaw and me are ok with using z for 𮗙𰒥𫸪𰨇𰲞𬢆. But I am open to different opinion from community.

  1. Stay with Cangjie5 code schemes and come up with our own mapping for 𮗙𰒥𫸪𰨇𰲞𬢆. I can revise this PR on the new mappings

  2. remove 𮗙𰒥𫸪𰨇𰲞𬢆 from mappings and postpone until we have consensus on how to encode 𮗙𰒥𫸪𰨇𰲞𬢆.

My preference on these 3 solutions is 1 > 2 > 3.

@Un1Gfn
Copy link

Un1Gfn commented Dec 24, 2021

https://github.com/rime/rime-cangjie/blob/8dfad9e537f18821b71ba28773315d9c670ae245/cangjie5.dict.yaml?raw=1

line 16
# 包含結構的單字,被包含部分的編碼位於'符號之後,可據此取得尾碼。

Is this for the dict feature?
How many chars introduced in this PR are actually involved in any dict that we provide?
it'd better be zero ;P

@JLHwung JLHwung marked this pull request as ready for review December 24, 2021 15:45
@JLHwung
Copy link
Author

JLHwung commented Dec 24, 2021

How many chars introduced in this PR are actually involved in any dict that we provide?

I am not familiar with the dict feature. Can you point me to some references?

Disclaimer: I use the mapping a lot and mostly query only Ext. A - G characters. I would say the mapping is quite good in general. Supporting new Ext blocks is hard and I think it is fine to just merge the PR and move forward. We can always iterate when we found errors or if we can do more for the dict feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

爲擴展E區漢字編碼
2 participants