Improve "Auto detect text encoding for Chinese in Gedit"

Bug #1074572 reported by Ma Hsiao-chun
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Tweak
Fix Released
Medium
Ding Zhou

Bug Description

I'm using Ubuntu Tweak 0.8.1 on Ubuntu 12.04

As I checked related code for "Auto detect text encoding for Chinese in Gedit", I think there are several issues.
http://bazaar.launchpad.net/~vcs-imports/ubuntu-tweak/master/view/head:/ubuntutweak/tweaks/workarounds.py

1. You may need to change "Chinese" to "Chinese Simplified" to clearly indicate the applicability of this feature. Since GB family of encodings (inlcuding GB18030) is mainly used in Mainland China while Hong Kong and Taiwan uses Big5 family of encodings. (Big5-HKSCS may be the universal set of Big5 family).

2. I wonder whether put GB18030 before UTF-8 is good practice.
http://manpages.ubuntu.com/manpages/precise/en/man7/utf-8.7.html
ftp://ftp.software.ibm.com/software/globalization/documents/gb18030m.pdf
As you can see from the above documents, a GB18030 file can hardly fit into UTF-8, since UTF-8 impose a special scheme for bytes that corresponding a single character. On the other hand, a UTF-8 file have good chance fit into GB18030 since its two-byte scheme cover a quite large range.
Therefore, I believe that place UTF-8 at first would be better and the detection may only fail on very short file content.

3. Are you plan to support other encodings? It may seem like a very big project. However, the number of commonly used encodings are much much less than the number of languages. A good reference would be Google Chrome's encoding menu, it is not that long.
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/browser/character_encoding.cc?view=markup
(Firefox has a longer list, but I find that it is just because Firefox also included some rarely used encodings. For example HZ for Chinese Simplified and IBM-8XX codepage for several language groups. )

Ding Zhou (tualatrix)
Changed in ubuntu-tweak:
milestone: none → 0.8.3
assignee: nobody → Ding Zhou (tualatrix)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Ding Zhou (tualatrix) wrote :

Thanks for your report.

1. Yes, you're right, it should be "Chinese Simplified", I will fix in the future release.

2. This sounds right too. But I think I tried put UTF-8 before GB18030 then opened some GB* encoding file, it couldn't detect the file encoding correctly. Anyway I will try again.

3. I'm not going to support other encodings, just like what you say, it's a very big project. I even don't know if there're people using this feature.

Revision history for this message
Ma Hsiao-chun (mahsiaochun) wrote :

For 2, it would be nice if you can share some sample files for study purpose.

Ding Zhou (tualatrix)
Changed in ubuntu-tweak:
status: Confirmed → Fix Committed
Ding Zhou (tualatrix)
Changed in ubuntu-tweak:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.