diff-encodings.txt   [plain text]


Better Encoding and Newline Support In The Diff Algorithms

[NOTE: This is work-in-progress.]

Introduction
============

Currently, the diff handling routines in libsvn_diff know nothing
about character encodings and eol characters.  It assumes an
ASCII-based encoding and LF as line separator.  This leads to a lot of
problems:

* Diff output will be inconsistently encoded.
* Files with different line endings cause unexpected results (i.e. CR
  line endings).
* Diff output gets inconsistent line endings.
* Non-ASCII based encodings, such as UTF16 aren't supported at all by
  subversion out-of-the-box.

Solving this situation seems to be a lot of work.  The motivation for
starting this was issue #1533 'diff output doesn't use correct
encoding'.  This issue is solved, making the diff code assume the
locale encoding for file contents rather than UTF8, but the problems
discussed in this file are still present.

Header Encoding
===============

Currently, the headers are written using the locale encoding, which
is not always what's wanted.  If the encoding of the files is known
(via svn:mime-type, for example), the headers should probably be
written using that encoding.

Note that this applies to property change information and property
values in the svn: namespace as well.  For other properties, we can't
do anything but treat them as opaque.

Newlines
========

According to the GNU diff documentation, on systems with newline
separators other than just LF, the newlines are normalized to the
system markers, except when --binary is used.

Currently, our diff library understands nothing but LF as newline.
Making it accept CRLF and CR as well is not hard.

Since we know the newline marker used in the file via the
svn:eol-style property, we can handle this quite well.  If
svn:eol-style is not set, I suggest we output newlines as-is, and use
APR_EOL_STR to output newlines in headers.  That's consistent with how
GNU diff behaves with the --binary option.

When svn:eol-style is set, we should use that style for the headers.
The values might be different for the original and the new file; it
seems logical to use the value from the modified file.  Note that in
this case, newlines will be inconsistent anyway.  Also, the
libsvn_client should make sure the files are translated into their
newline style before comparing them (this is necessary since working
files don't have their newlines normalized if svn:eol-style is changed
in the working revision).  In the usual case, when svn:eol-style is
not changed, this will give consistent newlines for the whole diff.
If svn:eol-style is changed, the diff will contain every line in the
file with eol marker changes.  This is what happens currently if you
do a repos_to_repos diff with svn:eol-style changed.  If svn:eol-style
is set to native, then APR_EOL_STR should be used, as usual.

This requires that the svn_client_diff* functions read the
svn:eol-style property of the modified file and pass that information
to svn_diff_file_output_unified.  svn_diff_file_output_unified needs
an eolstr argument, giving the newline marker to use for headers.

Content Encoding
================

To support encodings that aren't ASCII-based (meaning that the first
128 bytes always means the same as in ASCII), Subversion needs to know
the encodings of the files being diffed.  We don't currently have a
canonical way of detecting the encoding.  It has been suggested to use
the charset parameter of svn:mime-type for this purpose.  Whatever
method we choose, we need to cope with the fact that not all files
have this information available.  In this case, we might assume the
locale/console encoding.

When the encodings of the files are known, the diff tokenizer should
use that to decide what newline separator it expects.  A simple
solution is to just recode "\n", "\r\n" and "\r" into the file
encodings and search for that.  Beware that to support UTF16 and other
forms of Unicode, we need to support null bytes in these strings.

NOTE: Supporting non-byte-oriented encodings such as UTF16 will
require work in other parts of the client libraries as well.  I'm
discussing it here to not design a solution where we can't support
that in the future.

To support this, svn_diff_file_diff will need arguments for the
encodings of the original and modified files.

Merge
=====

Merging (i.e. diff3) can be handled in similar ways to diff.  The
eol-style of the .mine file should be used for the conflict markers
and the files should be translated to their newline styles if needed.

The encoding part is a bit trickier.  If the encoding of all the three
files is the same, then conflict markers should use that encoding as
well.

NOTE: For UTF16 and UTF32, the BOM might be problematic.  Ideally, we
need to be careful to not add extra BOMs inside the file.  One idea is
to strip the BOMs before merging and ensure that the resulting file
has a BOM after the merge.  I'm not sure how much encoding specific
code we want to add to our diff library.  Maybe UTF16 would be
considered common enough to not handle it like "just another
encoding".  For UTF8, we may need to handle the BOM as well, since
that's allowed.  We need to be careful not to add BOMs that aren't in
the files, since that will break applications (and we don't want to
silently change the contents of users' files!)

If the encodings are different for the three files, merging could
easily lead to an inconsistent mess, unless the encodings share some
subset (like when changing from US-ASCII to UTF-8).  I think we should
leave those rare cases to the user, who can recode and merge by hand
or use some other tool.