Ubuntu Tweak

Improve "Auto detect text encoding for Chinese in Gedit"

Bug #1074572 reported by Ma Hsiao-chun on 2012-11-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu Tweak	Fix Released	Medium	Ding Zhou	Ubuntu Tweak 0.8.3

Bug Description

I'm using Ubuntu Tweak 0.8.1 on Ubuntu 12.04

As I checked related code for "Auto detect text encoding for Chinese in Gedit", I think there are several issues.
http://bazaar.launchpad.net/~vcs-imports/ubuntu-tweak/master/view/head:/ubuntutweak/tweaks/workarounds.py

1. You may need to change "Chinese" to "Chinese Simplified" to clearly indicate the applicability of this feature. Since GB family of encodings (inlcuding GB18030) is mainly used in Mainland China while Hong Kong and Taiwan uses Big5 family of encodings. (Big5-HKSCS may be the universal set of Big5 family).

2. I wonder whether put GB18030 before UTF-8 is good practice.
http://manpages.ubuntu.com/manpages/precise/en/man7/utf-8.7.html
ftp://ftp.software.ibm.com/software/globalization/documents/gb18030m.pdf
As you can see from the above documents, a GB18030 file can hardly fit into UTF-8, since UTF-8 impose a special scheme for bytes that corresponding a single character. On the other hand, a UTF-8 file have good chance fit into GB18030 since its two-byte scheme cover a quite large range.
Therefore, I believe that place UTF-8 at first would be better and the detection may only fail on very short file content.

3. Are you plan to support other encodings? It may seem like a very big project. However, the number of commonly used encodings are much much less than the number of languages. A good reference would be Google Chrome's encoding menu, it is not that long.
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/browser/character_encoding.cc?view=markup
(Firefox has a longer list, but I find that it is just because Firefox also included some rarely used encodings. For example HZ for Chinese Simplified and IBM-8XX codepage for several language groups. )

Ding Zhou (tualatrix) on 2012-11-04

Changed in ubuntu-tweak:
milestone:	none → 0.8.3
assignee:	nobody → Ding Zhou (tualatrix)
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Ding Zhou (tualatrix) wrote on 2012-11-04:

Thanks for your report.

1. Yes, you're right, it should be "Chinese Simplified", I will fix in the future release.

2. This sounds right too. But I think I tried put UTF-8 before GB18030 then opened some GB* encoding file, it couldn't detect the file encoding correctly. Anyway I will try again.

3. I'm not going to support other encodings, just like what you say, it's a very big project. I even don't know if there're people using this feature.

Revision history for this message

Ma Hsiao-chun (mahsiaochun) wrote on 2012-11-04:

For 2, it would be nice if you can share some sample files for study purpose.

Ding Zhou (tualatrix) on 2012-12-13

Changed in ubuntu-tweak:
status:	Confirmed → Fix Committed

Ding Zhou (tualatrix) on 2012-12-21

Changed in ubuntu-tweak:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.