Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

JS: Handle surrogate pairs correctly #69

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

judofyr
Copy link

@judofyr judofyr commented Jun 20, 2019

The following PR makes the JS algorithm aware of surrogate pairs and will avoid splitting a pair into two different Diff objects.

I think there might be more places which needs to be surrogate-pair-aware, but I haven't been able to find test cases to confirm this.

Fixes #59, #68, #10

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@josephrocca
Copy link

josephrocca commented Aug 16, 2019

Hey @judofyr, thanks for your work on this! I was playing around with it and found a weird case that still throws a URI error:

let oldText = `h
  🅰
\t🅱\n`;

let newText = `hello
  🅰
\t🅱\n`;

dmp.patch_toText(dmp.patch_make(oldText, newText));

https://jsfiddle.net/g7jsa4me/

I haven't dug into what's causing this yet.

@judofyr
Copy link
Author

judofyr commented Aug 16, 2019

Oooo, very nice! Thanks for the reduced test case. I haven't had time to look more into this issue, and will probably not have time in the next few weeks. I'll report here if I end up picking it up again.

@josephrocca
Copy link

👍 eagerly waiting!

@KpjComp
Copy link

KpjComp commented Aug 28, 2019

Hi, this bug is a tad worrying.

So I thought I'd see if I could see were the problem is.

For @josephrocca issue the following function is the cause ->
Function diff_match_patch.prototype.patch_addContext_ = function(patch, text) {

// Add the suffix.
var suffix = text.substring(patch.start2 + patch.length1,
      patch.start2 + patch.length1 + padding);

Above the suffix is getting split on a surrogate.
Using @judofyr fix and applying here seems to fix it..

eg.

// Add the suffix.
var pointermid = patch.start2 + patch.length1 + padding;
  if (insideSurrogate(text, pointermid)) {
    pointermid--;
}
var suffix = text.substring(patch.start2 + patch.length1,  pointermid);

Not sure, but I would also expect it might be possible for the prefix to also suffer with this issue, so applying the above code for prefix might make sense.

But doing the above seems to makes @josephrocca code work.

One slight concern is that doing pointermid-- could effect padding (Patch_Margin), but not sure if this is a big issue, the default Patch_Margin is 4, so still valid. But if this is an issue I suppose you could use pointermid++..

Hope this helps..

@josephrocca
Copy link

@KpjComp Can you put your fix into a pull request so I can run some tests on it? I would be great if this issue were finally fixed! Might be able to attract the attention of a maintainer and actually get this merged. I'm very ignorant when it comes to Unicode stuff, so at this point I can't verify any of your code by reading it, but I can put it through some significant testing, and hopefully @judofyr will be able to have a quick read over it at some point to provide feedback.

@KpjComp
Copy link

KpjComp commented Aug 28, 2019

@josephrocca , I was kind of hoping @judofyr might be able to apply the patch when he see's this message. Looks like he just needs to reply with I signed it! or something.

What you could do, is do what I have done as a temp measure,.. Just open the index.js file in the diff-match-patch in your node_modules and edit this file directly and apply mine and @judofyr fix. I could maybe attach my modified version here if you want.

Also, just in case there are any more issues with unicode surrogate pairs and patching what I have done in my application is when I create a diff, I also then apply the diff to the original and verify it it's the same, so if there any any more problems I will know about it.. :)

@josephrocca
Copy link

josephrocca commented Aug 28, 2019

I've replaced this:

  // Add the suffix.
  var suffix = text.substring(patch.start2 + patch.length1,
                              patch.start2 + patch.length1 + padding);

with this:

  // Add the suffix.
  var pointermid = patch.start2 + patch.length1 + padding;
  if (insideSurrogate(text, pointermid)) {
    pointermid--;
  }
  var suffix = text.substring(patch.start2 + patch.length1,  pointermid);

And it did fix the original error I reported! But unfortunately when I ran this test on it:

// random emoji + text copied from wikipedia:
const originalText = `U+1F17x	🅰️	🅱️		🅾️	🅿️ safhawifhkw
U+1F18x															🆎	
0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
U+1F19x		🆑	🆒	🆓	🆔	🆕	🆖	🆗	🆘	🆙	🆚					
U+1F20x		🈁	🈂️							sfss.,_||saavvvbbds						
U+1F21x	🈚					
U+1F22x			🈯
U+1F23x			🈲	🈳	🈴	🈵	🈶	🈷️	🈸	🈹	🈺					
U+1F25x	🉐	🉑		
U+1F30x	🌀	🌁	🌂	🌃	🌄	🌅	🌆	🌇	🌈	🌉	🌊	🌋	🌌	🌍	🌎	🌏
U+1F31x	🌐	🌑	🌒	🌓	🌔	🌕	🌖	🌗	🌘	🌙	🌚	🌛	🌜	🌝	🌞	`;

// applies some random edits to string and returns new, edited string
function applyRandomTextEdit(text) {
  let textArr = [...text];
  let r = Math.random();
  if(r < 1/3) { // swap
	let swapCount = Math.floor(Math.random()*5);
    for(let i = 0; i < swapCount; i++) {
	  let swapPos1 = Math.floor(Math.random()*textArr.length);
      let swapPos2 = Math.floor(Math.random()*textArr.length);
      let char1 = textArr[swapPos1];
      let char2 = textArr[swapPos2];
      textArr[swapPos1] = char2;
      textArr[swapPos2] = char1;
    }
  } else if(r < 2/3) { // remove
    let removeCount = Math.floor(Math.random()*5);
    for(let i = 0; i < removeCount; i++) {
      let removePos = Math.floor(Math.random()*textArr.length);
      textArr[removePos] = "";
    }
  } else { // add
    let addCount = Math.floor(Math.random()*5);
    for(let i = 0; i < addCount; i++) {
      let addPos = Math.floor(Math.random()*textArr.length);
      let addFromPos = Math.floor(Math.random()*textArr.length);
      textArr[addPos] = textArr[addPos] + textArr[addFromPos];
    }
  }
  return textArr.join("");
}

let dmp = new diff_match_patch();

for(let i = 0; i < 1000; i++) {
  newText = applyRandomTextEdit(originalText);
  dmp.patch_toText(dmp.patch_make(originalText, newText));
}

It produced the "URI malformed" error again: https://jsfiddle.net/78fw5cjv/1/ :(

I didn't know how to apply your fix to the prefix stuff so I wasn't able to test whether that might be required, as you suggested.

@judofyr
Copy link
Author

judofyr commented Aug 28, 2019

Wow, that's a very neat way of finding issues! And great work with finding another fix.

The status for me is the same though: I don't have time to look at this at the moment. Feel free to bring my changes into another branch/fork and continue there.

@KpjComp
Copy link

KpjComp commented Aug 29, 2019

@josephrocca Ok, digging a bit deeper, I've found a much simpler fix..

Basically the diffing algorithm actually has no problems with surrogate pairs, but what does have a problem is the encodeURI that's used for the patch_toText function , if the prefix or suffix has half a surrogate pair it will error.

So another fix is to create a simple encodeURI that removes part surrogate pairs.

Basically its only ever going to be the first & last character that will ever have a part surrogate pair, so by checking the first character and if it's an ending surrogate pair, remove it, and then check the last character and if its a starting surrogate pair remove that.

Here is a function I've created that does this ->

function encodeURISafe(s) {
  var st = 0, en = s.length;
  var stcode = s.charCodeAt(0);
  var encode = s.charCodeAt(en - 1);
  if (stcode >= 0xDC00 && stcode <= 0xDFFF) {
    st += 1;
  }
  if (encode >= 0xd800 && encode <= 0xdbff) {
    en -= 1;
  }
  return encodeURI(s.substring(st, en));
}

Now the only place we need to use this encodeURISafe function is in the diff_match_patch.patch_obj.prototype.toString function.

Basically replace the encodeURI with encodeURISafe, and this will work. And because prefix & suffix has a default padding of 4, removing an invalid character I don't think should cause any issues.

Also you will want to now remove @judofyr fix, as it actually does cause other issues with stack overflows using your demo.

@judofyr
Copy link
Author

judofyr commented Aug 29, 2019

So another fix is to create a simple encodeURI that removes part surrogate pairs.

Have you tried applying the patches to the original text? It seems to me that if you remove the erroneous surrogate character you will end up with a patch that doesn't include that character. This returns false in my console with your fix:

p = dmp.patch_make('\u{1F30D}', '\u{1F308}');
strp = dmp.patch_toText(p);
p2 = dmp.patch_fromText(strp);
console.log(dmp.patch_apply(p2, '\u{1F30D}')[0] == '\u{1F308}')

(Your fix also causes 5 test failures in this branch.)

@judofyr
Copy link
Author

judofyr commented Aug 29, 2019

However, the approach your taking might actually work. Just instead of removing that character, we want to add the missing part of the surrogate pair.

@judofyr
Copy link
Author

judofyr commented Aug 29, 2019

Another realization I just had: the coords/numbers in the patches refers to UTF-16 units, not Unicode code point units. This is probably not what the other implementations do. It seems to not cause a problem in practice because DMP is good a recovering context, but it's certainly something to have in mind if you're using DMP across languages.

@KpjComp
Copy link

KpjComp commented Aug 29, 2019

Have you tried applying the patches to the original text? It seems to me that if you remove the erroneous surrogate character you will end up with a patch that doesn't include that character. This returns false in my console with your fix:

Unfortunately it was late when I got this working, and never really had chance to give a good test in this regards. I was kind of hoping this would only be applying to prefix & suffix context, and I don't believe lengths matter here, but thinking about it there is nothing stopping a change breaking on a surrogate either.. :(

Another realization I just had: the coords/numbers in the patches refers to UTF-16 units, not Unicode code point units.

Yes, I did wonder about that.
Also does Diff patches work out the box with utf-8 anyway. If so, there would be no reason even to encodeURI Unicode strings. And does a standard Diff even understand URI encoding for diffs?

Also there is unicode's ZERO WIDTH JOINER: to maybe think about. Not sure this would be an issue, as encodeURI has no issues with a lone joiner, and inside strings will just be invisible. But would make human readable Diffs impractical.

Taking all this into consideration, it would appear the Javascript version here has many issues with unicode. Apart from the ZERO WIDTH JOINER issue most issues could be fixed by re-writing the code to use Array.from / [...string] manipulation, but this would be a big task. :(

So this is what I'm going to do, and in my case should fix any problems.

Basically encodeURI the whole input text.
And decodeURI any results, and also remember to encodeURI any source when applying patches before decodeURI.
IOW: keep all Diff changes in encoded format, this will fix any surrogate & offset problems.

@dmsnell
Copy link

dmsnell commented Nov 8, 2019

Thanks for the work on this. We discovered this problem today while debugging a problem where certain edits crash Simplenote. The Objective-C library at least also demonstrates the same root problem whereby commonPrefix splits at code units but then toDelta tries to construct URL-encoded substrings.

Another thought on post-processing

If instead of creating a URIEncodeSafe function could we analyze the previous diff string and look for surrogate breaks, then rewrite the diffs to secure them? It seems like those groups are safe to rewrite without having to maintain indices or other constraints.

diff_match_patch.prototype.diff_toDelta = function(diffs) {
  var text = [];
  var lastEnd = null;
  for (var x = 0; x < diffs.length; x++) {
    
    var thisDiff = diffs[x];
    var thisTop = thisDiff[1][0];
    var thisEnd = thisDiff[1][ thisDiff[1].length - 1 ];

    // don't end mid-surrogate-pair
    if ( isHighSurrogate( thisEnd ) ) {
      thisDiff[1] = thisDiff[1].slice(0, -1);
    }

    // don't start mid-surrogate-pair
    if ( isHighSurrogate( lastEnd ) && isLowSurrogate( thisStart ) ) {
      thisDiff[1] = lastEnd + thisDiff[1];
    }

    lastEnd = thisEnd;

    switch (diffs[x][0]) {
      case DIFF_INSERT:
        text[x] = '+' + encodeURI(diffs[x][1]);
        break;
      case DIFF_DELETE:
        text[x] = '-' + diffs[x][1].length;
        break;
      case DIFF_EQUAL:
        text[x] = '=' + diffs[x][1].length;
        break;
    }
  }
  return text.join('\t').replace(/%20/g, ' ');
};

Obviously some things here would need bounds checking but that was the idea I had. I'll try and test it out to see if I can recreate/resolve the tests cases in this thread.

dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Nov 9, 2019
Resolves google#69 for JavaScript

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
@dmsnell
Copy link

dmsnell commented Nov 9, 2019

The above link to #80 is my attempt to demonstrate working code with tests. My patch only changes the diffing where it becomes a problem - when attempting to build a valid UTF-8 string in URIEncode().

The method appears to work and I think all it needs for totality is to handle cases where our diff groups are too short to chop off the front or back of them.

@dmsnell
Copy link

dmsnell commented Dec 11, 2019

@judofyr today I came back to your patch and I think the problem with the extended tests stems from the way the patch conflates high surrogates and low surrogates.

in general, we can end a group on a low surrogate and we can start a group on a high surrogate. the reverse is problematic: ending on a high surrogate and starting on a low surrogate.

that is, instead of using insideSurrogate(), if you use isHighSurrogate() in diff_commonPrefix() and isLowSurrogate() in diff_commonSuffix() then the patch should work again.

diff --git a/javascript/diff_match_patch_uncompressed.js b/javascript/diff_match_patch_uncompressed.js
index 88a702c..c731b8c 100644
--- a/javascript/diff_match_patch_uncompressed.js
+++ b/javascript/diff_match_patch_uncompressed.js
@@ -88,6 +88,15 @@ diff_match_patch.Diff.prototype.toString = function() {
   return this[0] + ',' + this[1];
 };
 
+diff_match_patch.prototype.is_high_surrogate = function(c) {
+  var v = c.charCodeAt(0);
+  return v >= 0xD800 && v <= 0xDBFF;
+};
+
+diff_match_patch.prototype.is_low_surrogate = function(c) {
+  var v = c.charCodeAt(0);
+  return v >= 0xDC00 && v <= 0xDFFF;
+};
 
 /**
  * Find the differences between two texts.  Simplifies the problem by stripping
@@ -439,6 +448,12 @@ diff_match_patch.prototype.diff_bisect_ = function(text1, text2, deadline) {
  */
 diff_match_patch.prototype.diff_bisectSplit_ = function(text1, text2, x, y,
     deadline) {
+  if (this.is_low_surrogate(text1, x)) {
+    x--;
+  }
+  if (this.is_low_surrogate(text2, y)) {
+    y--;
+  }
   var text1a = text1.substring(0, x);
   var text2a = text2.substring(0, y);
   var text1b = text1.substring(x);
@@ -569,6 +584,14 @@ diff_match_patch.prototype.diff_commonPrefix = function(text1, text2) {
     }
     pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
   }
+
+  if (
+    this.is_high_surrogate(text1[pointermid - 1]) ||
+    this.is_high_surrogate(text2[pointermid - 1])
+  ) {
+    return pointermid - 1;
+  }
+
   return pointermid;
 };
 
@@ -601,6 +624,14 @@ diff_match_patch.prototype.diff_commonSuffix = function(text1, text2) {
     }
     pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
   }
+
+  if (
+    this.is_low_surrogate(text1[text1.length - pointermid]) ||
+    this.is_low_surrogate(text2[text2.length - pointermid])
+  ) {
+    return pointermid - 1;
+  }
+
   return pointermid;
 };
 
diff --git a/javascript/tests/diff_match_patch_test.js b/javascript/tests/diff_match_patch_test.js
index 109e56a..ede410c 100644
--- a/javascript/tests/diff_match_patch_test.js
+++ b/javascript/tests/diff_match_patch_test.js
@@ -89,6 +89,17 @@ function testDiffCommonPrefix() {
 
   // Whole case.
   assertEquals(4, dmp.diff_commonPrefix('1234', '1234xyz'));
+
+  // surrogate pairs
+  assertEquals(0, dmp.diff_commonPrefix(
+    '\ud83c\udd70', // 🅰
+    '\ud83c\udd71', // 🅱
+  ));
+
+  assertEquals(2, dmp.diff_commonPrefix(
+    '\ud83c\udd70\ud83c\udd70', // 🅰🅰
+    '\ud83c\udd70\ud83c\udd71', // 🅰🅱
+  ));
 }
 
 function testDiffCommonSuffix() {
@@ -101,6 +112,17 @@ function testDiffCommonSuffix() {
 
   // Whole case.
   assertEquals(4, dmp.diff_commonSuffix('1234', 'xyz1234'));
+
+  // surrogate pairs
+  assertEquals(0, dmp.diff_commonSuffix(
+    '\ud83c\udd70', // 🅰🅰
+    '\ud83d\udd70', // 🅰🕰
+  ));
+
+  assertEquals(2, dmp.diff_commonSuffix(
+    '\ud83c\udd70\ud83c\udd70', // 🅰🅰🅰
+    '\ud83d\udd70\ud83c\udd70', // 🅰🕰🅰
+  ));
 }
 
 function testDiffCommonOverlap() {

All in all I think the approach in #80 is stronger because the assumption of working
on UTF-16 code units goes deep in here and some of the cleanup and merging and
bisecting code is at least hard for me to reason about when thinking of adjusting
indices.

In another test case we came up with one of the semantic cleanup functions is returning
empty patch groups when using the optimized diff method too but I couldn't find out
why. It seems like it must be related to the same Unicode encoding issue.

dmp.diff_main(
  '😃🖖🏻🖖🏻🖖🏿\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\n\n.👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽s👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿.\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\n.👉🏽👈🏿👈🏿👈🏿🥶🥶🥶🥶🥶👈🏼🖖🏻👈🏼s🖖🏻👈🏼😋\n\ndasdas\ndasvafksdldasfadsxc vzx',
  '😃🖖🏻🖖🏻🖖🏿\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\nfds\n.👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽s👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿.\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉fdsvadsfewdsfdsafsdafsd'
) // produces an empty `[ DELETE, "" ]` group in the middle of the string

@judofyr
Copy link
Author

judofyr commented Dec 12, 2019

in general, we can end a group on a low surrogate and we can start a group on a high surrogate. the reverse is problematic: ending on a high surrogate and starting on a low surrogate.

Aah. Maybe. I always confuse what "low" and "high" means.

All in all I think the approach in #80 is stronger because the assumption of working
on UTF-16 code units goes deep in here and some of the cleanup and merging and
bisecting code is at least hard for me to reason about when thinking of adjusting
indices.

I feel like the approach in this PR is probably the "correct" way of doing it, but it requires detailed knowledge about how the algorithm works and I don't feel that comfortable with it. From a practical point of view #80 is much simpler and likelier to handle all cases. I like it! They are also not mutually exclusive. We can introduce #80 and then also try to make the core algorithm work properly on UTF16.

In another test case we came up with one of the semantic cleanup functions is returning
empty patch groups when using the optimized diff method too but I couldn't find out
why. It seems like it must be related to the same Unicode encoding issue.

Ouch. That's good to know.

The biggest problem here is the maintenance of the project. There doesn't seem to be any progress here. We at @sanity-io are considering forking the JS version and incorporating the fixes in here and/or #80, but currently it's not been a priority.

@dmsnell
Copy link

dmsnell commented Dec 12, 2019

Thanks @judofyr for your quick response!

They are also not mutually exclusive.

That's been our finding too. It was that empty DIFF_DELETE group that tossed a thorn into our approach in #80 (though the fix for that should be quite simple and I hope to post an update today).

My reasoning for #80 is purely because nothing else in the code depends on or assumes valid Unicode data, so I frame this PR as "we don't have to do this but we are because it prevents a number of confusing situations" - I am quite fond of the fact that we have the liberty to make these semantically-equivalent diff changes.

Where I ran into trouble here were the places you mentioned - I don't know the algorithm well enough given the code and in some strange sense I suppose it might be easier to rewrite the entire file with a tighter Unicode assumption/constraint. _bisect and _cleanupMerge are where I had the most difficulty.

Well, my point is that I think that the approach in this PR is incomplete until we make the same surrogate checks everywhere, or at least it's hard to reason about the fix until that has been done or we find more comprehensive test cases.

We at @sanity-io are considering forking the JS version and incorporating the fixes

I've been thinking of this too. @NeilFraser what can we do to prevent forking? How can we help you with this? It's amazing that the repository has been moved to GitHub and that you've updated so much of it. We're happy to help steward what we can since Simplenote/Simperium depends so fundamentally on diff-match-patch.

dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Mar 31, 2020
See google#69

In this patch I'm trying to see if we can scan through strings in
the library while preserving Unicode semantics if we stay close to
native JavaScript strings. That is, by only accounting for things
like surrogates when we have to.

As is obvious from inspection, this work is incomplete and it was
the attempt that has led me to pursue concverting the intput strings
up-front into an already-split array of Unicode scalar values.
@dmsnell
Copy link

dmsnell commented Mar 31, 2020

@judofyr are you using this patch at @sanity-io? I finally got around to digesting the Myers algorithm and the things that @NeilFraser has written about this library over the years (super helpful, thanks Neil!) and I'm more doubtful of this approach now.

Tonight I attempted to alter the library so that it would properly handle surrogate pairs, from the ground up if you will, and that broke down much harder than I anticipated.

This library relies heavily on array indexing which gets destroyed if the characters are variable-length, which indeed they are. The original Myers algorithm too depends on it, and possibly even more so. When I got to y1 = x1 - k1 I hit a particularly conundrum. The concept is that the index into text2 can be implicitly stored just with the knowledge of the index into text1, but we have to traverse the string differently to find the previous or successive code points and so everything gets hairy. I think we can do it if we double all the variables and track x1 and x1_u (the code-unit index into the string + the code-point index into the string) but that leads me to the next problem…

This library relies on being able to primitively scan forward and backward in a string by adding or subtracting from an index. Each time we do this and are aware of the Unicode rules we have to iterate - meaning that we turn constant operations into ones that have an unspecified number of iterations, and we do that every time we perform indexing.


My main concerns revolve around finding the middle snake and then the shortest edit script. Maybe we can get away with some pin-pointed index adjustments but I feel like it was getting out of hand as I actually started to do it.


That makes #80 more convenient and robust than I first realized. I found a Rust implementation of this library that performs Unicode boundary cleanup in the diffing process alongside the semantic and lossless cleanups. Apart from that and the fact that they are using string slices the fix was almost identical.


Independently I'd like to try splitting the text when it enters the library and then re-joining it when it leaves. That would be a manual version of [...str] for compatibility. Moving from strings to arrays carries big implications as well though, but at least our diff "alphabet" remains fixed-width.

I'm afraid that no matter what we do to fix it the performance will suffer. In other words, it's really fast because it gives the wrong answer for cases where it would otherwise be slow 🙃

dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Apr 2, 2020
See google#69

In this patch I'm trying to follow the approach taken by @judofyr
but starting from the top of the script and going through, auditing
every place that perform string operations that split, index, or
otherwise operate on a character level so that we can make sure that
we don't split surrogate pairs.

This contrasts with [attempt one] where I created a custom iterator
for strings. Surprisingly I found this more "ad-hoc" approach easier
to manage since it doesn't create a split universe of string/Unicode.

As of this commit I haven't audited the cleanup functions but my own
tests are passing so I'm given to believe that they might be safe.
I have my own doubts that this is sound work and that the middle-snake
algorithm might find the wrong snake when presented with variable-width
characters.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Apr 2, 2020
See google#69

In this patch I'm trying to follow the approach taken by @judofyr
but starting from the top of the script and going through, auditing
every place that perform string operations that split, index, or
otherwise operate on a character level so that we can make sure that
we don't split surrogate pairs.

This contrasts with [attempt one] where I created a custom iterator
for strings. Surprisingly I found this more "ad-hoc" approach easier
to manage since it doesn't create a split universe of string/Unicode.

As of this commit I haven't audited the cleanup functions but my own
tests are passing so I'm given to believe that they might be safe.
I have my own doubts that this is sound work and that the middle-snake
algorithm might find the wrong snake when presented with variable-width
characters.
@@ -601,6 +619,9 @@ diff_match_patch.prototype.diff_commonSuffix = function(text1, text2) {
}
pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
}
if (insideSurrogate(text1, text1.length - pointermid)) {
pointermid--;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with this instead, but I don't have a test that fails in either case so obviously I'm missing a test case

if (pointer mid < length - 1 && this.isLowSurrogate(text1[poitnermid])) {
  pointermid++;
}

my reasoning was that if the common suffix starts on a low surrogate then we have to bump it to the right so that the surrogate half doesn't end up in the common part - backwards from commonPrefix where we want to shift to the left in order to avoid the same for high surrogates

dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Apr 2, 2020
See google#69

In this patch I'm trying to follow the approach taken by @judofyr
but starting from the top of the script and going through, auditing
every place that perform string operations that split, index, or
otherwise operate on a character level so that we can make sure that
we don't split surrogate pairs.

This contrasts with [attempt one] where I created a custom iterator
for strings. Surprisingly I found this more "ad-hoc" approach easier
to manage since it doesn't create a split universe of string/Unicode.

As of this commit I haven't audited the cleanup functions but my own
tests are passing so I'm given to believe that they might be safe.
I have my own doubts that this is sound work and that the middle-snake
algorithm might find the wrong snake when presented with variable-width
characters.
@dmsnell
Copy link

dmsnell commented Apr 2, 2020

So this may not be the best place to hold this discussion but I've had a hard time coming to…terms…with how the algorithm handles variable-width letters/alphabets. What I'm nervous about with in-place adjustments like pointermid-- is that we might run into situations where we shift the surrogate-splitting into different parts of the edit graph and come to an invalid sub-diff in the process. I have given a name to this method as "Unicode rewind" since it involves backing up one character when we detect an invalid Unicode boundary.

Creating the iterator was itself a bad idea because it created the two-worlds divide between JS strings and Unicode Scalar Values. It turned out to be much easier to reason about the code and "backup" when necessary (or advance), as I have been doing in 00260dd. So far it looks better than I expected it to be with the backup/advance logic. Put in different words I don't consider it my work so much as my attempt at reviewing this PR, which I had to do by starting over from principle and seeing how close our patches came. I haven't reviewed the bisect or cleanup functions yet but so far most of my tests are passing which seems like a good sign.

Where did we differ?

I found that we need to perform surrogate checks in compute_ as well because otherwise we might cut out surrogates from the middle of a string. In review I think your code is right and those aren't needed because we'd have to start with an invalid string to find a whole match that ends on half a surrogate (in cases where the shorter string is entirely inside the longer string).

However, I think it may still be necessary inside halfMatch because there I'm pretty sure this case demonstrates where it might otherwise find a substring at least as long as half the length of the longer text which splits surrogate halves. In the example below h stands for a high surrogate and 1, 2, 3, etc… stand for different low surrogates.

ah0h1h2h3b
xh0h1h2h4h5x

Here the substring h0h1h2h from the shorter string exists entirely within the longer string but we can't take that whole length. When added as a test this case does indeed fail, but it appears to fail inside bisect so I'll have to review that usage on another day when I can get my focus into it. I came to this hypothesis last night while trying to understand the implications of what I'm calling "Unicode rewind" and had to draw a picture to help. I think that when we backup inside bisectSplit we may just replay the same problems in a different sub-diff.

MiddleSnakeUnicodeRewindA
The red line is the actual edit script produced by the existing code. Rewinding in order to split at a different point only protects the first side of the split and the surrogate is split when recursing to diff the second half of the split; this is why my test case failed. The blue line is what I believe to be the shortest edit script.

MiddleSnakeC
What's really happening as a result of these index-inconsistencies is that we're identifying an invalid diagonal edge and then treating it as the middle snake when in fact it's not a valid snake at all.

I'm worried again though that we're going to have to start re-scanning the string inside bisect if we're going to do this right and that will become quadratic or worse, quickly.

Another idea I have been pondering is that we could scan the input string at the start of the program and for each character relate the index into the string with the index into the equivalent Unicode Scalar Value array

// how many USVs are we in?
'abc\ud83c\udd70\ud83c\udd71xyz'
[0123344567]
// convert up-front inside bisect
t1 = [ ...text1 ]
t2 = [ ...text2 ]

…

diff_bisectSplit_(t1.join(''), t2.join(''), utf16length(t1.slice(0, x1), utf16length(t2,slice(0, y1), deadline);

Right now what I see as the big problems are:

  • delta has to be Unicode Scalar Values, not UTF-16 code units otherwise we might stop at the wrong point when finding the "middle snake"
  • x1 = v1[k1_offset + 1]; and x1 = v1[k1_offset - 1] + 1; do not move over or down, they might move into the middle of a surrogate pair
  • var y1 = x1 - k1 may give us a y-value in the middle of a surrogate pair
  • text1.charAt(x1) == text2.charAt(y1)) only accounts for code units, not code points
  • x1++ and y1++ have obvious problems

Maybe if I can chart out every counter and label which coordinates they are using this will become clearer. In any case, to find the middle snake I think we have to perform at minimum one translation into USVs if for no other reason than to figure out delta.

That's it for me for tonight 😴

@judofyr
Copy link
Author

judofyr commented Apr 5, 2020

@dmsnell: Very impressive work! I'll have to admit that the cases where I added Unicode rewinding was mainly determined "experimentally": I added various test cases and tried to figure out where the patch was created. What you've done is far more extensive.

At Sanity.io we're still using the current version which is buggy. We were hoping that the work in #80 would be accepted and merged in.

@gamedevsam
Copy link

I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to make this solution more robust.

dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 30, 2024
Resolves google#69 for JavaScript

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 30, 2024
Resolves google#69 for Java

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 30, 2024
Resolves google#69 for Objective-C

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 30, 2024
Resolves google#69 for Python2

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 30, 2024
Resolves google#69 for Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
dmsnell added a commit to dmsnell/diff-match-patch that referenced this pull request Jan 31, 2024
Stop breaking surrogate pairs in toDelta()/fromDelta()

Resolves google#69 for the following languages:

 - Objective-C
 - Java
 - JavaScript
 - Python2
 - Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.

Alternative approaches:
=========

 - The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/)
   library in Rust takes a more comprehensive approach with its
   `cleanup_char_boundary()` method. Since that approach resolves the
   issue everywhere and not just in to/from Delta, it's worth
   exploring as a replacement for this patch.

Remaining work to do:
========

 -[ ] Fix CPP or verify not a problem
 -[ ] Fix CSharp or verify not a problem
 -[ ] Fix Dart or verify not a problem
 -[ ] Fix Lua or verify not a problem
 -[x] Refactor to use cleanupSplitSurrogates in JavaScript
 -[x] Refactor to use cleanupSplitSurrogates in Java
 -[ ] Refactor to use cleanupSplitSurrogates in Objective C
 -[ ] Refactor to use cleanupSplitSurrogates in Python2
 -[ ] Refactor to use cleanupSplitSurrogates in Python3
 -[ ] Refactor to use cleanupSplitSurrogates in CPP
 -[ ] Refactor to use cleanupSplitSurrogates in CSharp
 -[ ] Refactor to use cleanupSplitSurrogates in Dart
 -[ ] Refactor to use cleanupSplitSurrogates in Lua
 -[x] Fix patch_toText in JavaScript
 -[ ] Fix patch_toText in Java
 -[ ] Fix patch_toText in Objective C
 -[ ] Fix patch_toText in Python2
 -[ ] Fix patch_toText in Python3
 -[ ] Fix patch_toText in CPP
 -[ ] Fix patch_toText in CSharp
 -[ ] Fix patch_toText in Dart
 -[ ] Fix patch_toText in Lua
 -[ ] Figure out a "minimal" set of unit tests so we can get rid of the big
      chunk currently in the PR, then carry it around to all the libraries.
      The triggers are well understood, so we can write targeted tests
      instead of broad ones.
samrayner pushed a commit to samrayner/diff-match-patch that referenced this pull request May 6, 2024
Resolves google/diff-match-patch#69 for Objective-C

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
samrayner pushed a commit to samrayner/diff-match-patch that referenced this pull request May 6, 2024
Stop breaking surrogate pairs in toDelta()/fromDelta()

Resolves google/diff-match-patch#69 for the following languages:

 - Objective-C
 - Java
 - JavaScript
 - Python2
 - Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.

Alternative approaches:
=========

 - The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/)
   library in Rust takes a more comprehensive approach with its
   `cleanup_char_boundary()` method. Since that approach resolves the
   issue everywhere and not just in to/from Delta, it's worth
   exploring as a replacement for this patch.

Remaining work to do:
========

 -[ ] Fix CPP or verify not a problem
 -[ ] Fix CSharp or verify not a problem
 -[ ] Fix Dart or verify not a problem
 -[ ] Fix Lua or verify not a problem
 -[x] Refactor to use cleanupSplitSurrogates in JavaScript
 -[x] Refactor to use cleanupSplitSurrogates in Java
 -[ ] Refactor to use cleanupSplitSurrogates in Objective C
 -[ ] Refactor to use cleanupSplitSurrogates in Python2
 -[ ] Refactor to use cleanupSplitSurrogates in Python3
 -[ ] Refactor to use cleanupSplitSurrogates in CPP
 -[ ] Refactor to use cleanupSplitSurrogates in CSharp
 -[ ] Refactor to use cleanupSplitSurrogates in Dart
 -[ ] Refactor to use cleanupSplitSurrogates in Lua
 -[x] Fix patch_toText in JavaScript
 -[ ] Fix patch_toText in Java
 -[ ] Fix patch_toText in Objective C
 -[ ] Fix patch_toText in Python2
 -[ ] Fix patch_toText in Python3
 -[ ] Fix patch_toText in CPP
 -[ ] Fix patch_toText in CSharp
 -[ ] Fix patch_toText in Dart
 -[ ] Fix patch_toText in Lua
 -[ ] Figure out a "minimal" set of unit tests so we can get rid of the big
      chunk currently in the PR, then carry it around to all the libraries.
      The triggers are well understood, so we can write targeted tests
      instead of broad ones.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Diff breaks unicode characters for emojis
6 participants