diff headers should contain non-ascii filenames in user_encoding, not in utf-8

Bug #382699 reported by Alexander Belchenko
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
Alexander Belchenko

Bug Description

Currently bzr can produce diff as result of 5 different operations:

bzr diff
bzr commit --show-diff
bzr log -p
bzr merge --preview
bzr send

In most of these commands non-ascii filename always shown as utf-8 string. This is bad for windows users, because their locale never utf-8 (at least by default), and not always it's possible to switch console locale to utf-8 (and `chcp 65001` doesn't work on some Windows versions).
Furthermore, bzr itself does not understand cp65001 codepage because Python does not recognize it as utf-8, see http://bugs.python.org/issue6058.

I think all comands except `send` should always print filenames in user_encoding, or AT LEAST show them in user_encoding in the === line (as bzr ci --show-diff currently does). Because all these commands are intended to produce output for humans.

Also, I should note that GNU diff (from http://gnuwin32.sf.net) always print filenames in user_encoding.

See attached files with output of various commands.

Will be nice to fix this before 2.0. Some guidance needed from core devs, especially about writing tests for this changes.

Related branches

Revision history for this message
Alexander Belchenko (bialix) wrote :
Changed in bzr:
status: New → Confirmed
Revision history for this message
Alexander Belchenko (bialix) wrote :
Revision history for this message
Alexander Belchenko (bialix) wrote :
Revision history for this message
Alexander Belchenko (bialix) wrote :
Revision history for this message
Alexander Belchenko (bialix) wrote :
Revision history for this message
Alexander Belchenko (bialix) wrote :
Revision history for this message
Casufi (vladimirkotulskiy) wrote :

D:\Develop>bzr init test
Created a standalone tree (format: pack-0.92)

D:\Develop>cd test
D:\Develop\test>

D:\Develop\test>echo one>Файл.txt

D:\Develop\test>bzr status
unknown:
  Файл.txt

D:\Develop\test>bzr add
adding "Файл.txt"

D:\Develop\test>

D:\Develop\test>bzr diff
=== added file 'Файл.txt'
--- Файл.txt 1970-01-01 00:00:00 +0000
+++ Файл.txt 2009-06-02 11:36:44 +0000
@@ -0,0 +1,1 @@
+one

D:\Develop\test>

D:\Develop\test>bzr diff
=== added file '╨д╨░╨╣╨╗.txt'
--- ╨д╨░╨╣╨╗.txt 1970-01-01 00:00:00 +0000
+++ ╨д╨░╨╣╨╗.txt 2009-06-02 11:36:44 +0000
@@ -0,0 +1,1 @@
+one

Revision history for this message
Alexander Belchenko (bialix) wrote :

Another use case: problem with bzr diff + bzr patch roundtrip. See attached.

Revision history for this message
methane (songofacandy) wrote :

I think filename should encoded in utf-8 in patch file and should encoded in user_encoding in console.
(When versioned property implemented, I think bzr should encode file contents to user_encoding for console output too.)

description: updated
description: updated
Changed in bzr:
importance: Undecided → Medium
status: Confirmed → In Progress
assignee: nobody → Alexander Belchenko (bialix)
tags: added: diff unicode
Revision history for this message
Alexander Belchenko (bialix) wrote :

my branch fixes `diff` and `commit --show-diff` commands.

Revision history for this message
Alexander Belchenko (bialix) wrote :

For `merge --preview` I think we should do the same as for `diff`, e.g. use mbcs encoding on Windows.

For `log -p` I'm inclined to use terminal_encoding instead of mbcs, because there is other unicode output encoded in terminal_encoding.

The hardest thing is `send`. I think by default there is should be `utf-8` because it supposed to be bzr (machine) readable.

For `diff`, `log -p`, `merge --preview` and `send` I would like to introduce new command-line option --path-encoding to control the encoding of filenames in diff headers.

It seems my patch won;t be accepted to 2.1, so introducing new option is fine for bzr.dev, IMO.

Revision history for this message
Alexander Belchenko (bialix) wrote :

one more diff headers place: bzr shelve. For this one I'd like to use terminal_encoding.

tags: added: win32
Changed in bzr:
status: In Progress → Fix Committed
milestone: none → 2.2b4
Revision history for this message
Alexander Belchenko (bialix) wrote :

The final version of my patch (ready to review and land hopefully) using terminal encoding everywhere, instead of mbcs or user_encoding. Thus it's optimized to "show me diff on terminal" use case.

Revision history for this message
Alexander Belchenko (bialix) wrote :

My branch merged to bzr.dev.

Changed in bzr:
status: Fix Committed → Fix Released
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 382699] Re: diff headers should contain non-ascii filenames in user_encoding, not in utf-8

On 1 June 2010 14:58, Alexander Belchenko <email address hidden> wrote:
> The final version of my patch (ready to review and land hopefully) using
> terminal encoding everywhere, instead of mbcs or user_encoding. Thus
> it's optimized to "show me diff on terminal" use case.

Thanks. I think later we should introduce a concept of "the right
encoding for stdout" or "for this uifactory" and then that can detect
whether it's on a terminal or not.

--
Martin

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.