-
Notifications
You must be signed in to change notification settings - Fork 1.1k
JS: Handle surrogate pairs correctly #69
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hey @judofyr, thanks for your work on this! I was playing around with it and found a weird case that still throws a URI error: let oldText = `h
🅰
\t🅱\n`;
let newText = `hello
🅰
\t🅱\n`;
dmp.patch_toText(dmp.patch_make(oldText, newText)); https://jsfiddle.net/g7jsa4me/ I haven't dug into what's causing this yet. |
Oooo, very nice! Thanks for the reduced test case. I haven't had time to look more into this issue, and will probably not have time in the next few weeks. I'll report here if I end up picking it up again. |
👍 eagerly waiting! |
Hi, this bug is a tad worrying. So I thought I'd see if I could see were the problem is. For @josephrocca issue the following function is the cause ->
Above the suffix is getting split on a surrogate. eg.
Not sure, but I would also expect it might be possible for the prefix to also suffer with this issue, so applying the above code for prefix might make sense. But doing the above seems to makes @josephrocca code work. One slight concern is that doing Hope this helps.. |
@KpjComp Can you put your fix into a pull request so I can run some tests on it? I would be great if this issue were finally fixed! Might be able to attract the attention of a maintainer and actually get this merged. I'm very ignorant when it comes to Unicode stuff, so at this point I can't verify any of your code by reading it, but I can put it through some significant testing, and hopefully @judofyr will be able to have a quick read over it at some point to provide feedback. |
@josephrocca , I was kind of hoping @judofyr might be able to apply the patch when he see's this message. Looks like he just needs to reply with What you could do, is do what I have done as a temp measure,.. Just open the Also, just in case there are any more issues with unicode surrogate pairs and patching what I have done in my application is when I create a diff, I also then apply the diff to the original and verify it it's the same, so if there any any more problems I will know about it.. :) |
I've replaced this: // Add the suffix.
var suffix = text.substring(patch.start2 + patch.length1,
patch.start2 + patch.length1 + padding); with this: // Add the suffix.
var pointermid = patch.start2 + patch.length1 + padding;
if (insideSurrogate(text, pointermid)) {
pointermid--;
}
var suffix = text.substring(patch.start2 + patch.length1, pointermid); And it did fix the original error I reported! But unfortunately when I ran this test on it: // random emoji + text copied from wikipedia:
const originalText = `U+1F17x 🅰️ 🅱️ 🅾️ 🅿️ safhawifhkw
U+1F18x 🆎
0 1 2 3 4 5 6 7 8 9 A B C D E F
U+1F19x 🆑 🆒 🆓 🆔 🆕 🆖 🆗 🆘 🆙 🆚
U+1F20x 🈁 🈂️ sfss.,_||saavvvbbds
U+1F21x 🈚
U+1F22x 🈯
U+1F23x 🈲 🈳 🈴 🈵 🈶 🈷️ 🈸 🈹 🈺
U+1F25x 🉐 🉑
U+1F30x 🌀 🌁 🌂 🌃 🌄 🌅 🌆 🌇 🌈 🌉 🌊 🌋 🌌 🌍 🌎 🌏
U+1F31x 🌐 🌑 🌒 🌓 🌔 🌕 🌖 🌗 🌘 🌙 🌚 🌛 🌜 🌝 🌞 `;
// applies some random edits to string and returns new, edited string
function applyRandomTextEdit(text) {
let textArr = [...text];
let r = Math.random();
if(r < 1/3) { // swap
let swapCount = Math.floor(Math.random()*5);
for(let i = 0; i < swapCount; i++) {
let swapPos1 = Math.floor(Math.random()*textArr.length);
let swapPos2 = Math.floor(Math.random()*textArr.length);
let char1 = textArr[swapPos1];
let char2 = textArr[swapPos2];
textArr[swapPos1] = char2;
textArr[swapPos2] = char1;
}
} else if(r < 2/3) { // remove
let removeCount = Math.floor(Math.random()*5);
for(let i = 0; i < removeCount; i++) {
let removePos = Math.floor(Math.random()*textArr.length);
textArr[removePos] = "";
}
} else { // add
let addCount = Math.floor(Math.random()*5);
for(let i = 0; i < addCount; i++) {
let addPos = Math.floor(Math.random()*textArr.length);
let addFromPos = Math.floor(Math.random()*textArr.length);
textArr[addPos] = textArr[addPos] + textArr[addFromPos];
}
}
return textArr.join("");
}
let dmp = new diff_match_patch();
for(let i = 0; i < 1000; i++) {
newText = applyRandomTextEdit(originalText);
dmp.patch_toText(dmp.patch_make(originalText, newText));
} It produced the "URI malformed" error again: https://jsfiddle.net/78fw5cjv/1/ :( I didn't know how to apply your fix to the prefix stuff so I wasn't able to test whether that might be required, as you suggested. |
Wow, that's a very neat way of finding issues! And great work with finding another fix. The status for me is the same though: I don't have time to look at this at the moment. Feel free to bring my changes into another branch/fork and continue there. |
@josephrocca Ok, digging a bit deeper, I've found a much simpler fix.. Basically the diffing algorithm actually has no problems with surrogate pairs, but what does have a problem is the So another fix is to create a simple Basically its only ever going to be the first & last character that will ever have a part surrogate pair, so by checking the first character and if it's an ending surrogate pair, remove it, and then check the last character and if its a starting surrogate pair remove that. Here is a function I've created that does this ->
Now the only place we need to use this Basically replace the Also you will want to now remove @judofyr fix, as it actually does cause other issues with stack overflows using your demo. |
Have you tried applying the patches to the original text? It seems to me that if you remove the erroneous surrogate character you will end up with a patch that doesn't include that character. This returns false in my console with your fix: p = dmp.patch_make('\u{1F30D}', '\u{1F308}');
strp = dmp.patch_toText(p);
p2 = dmp.patch_fromText(strp);
console.log(dmp.patch_apply(p2, '\u{1F30D}')[0] == '\u{1F308}') (Your fix also causes 5 test failures in this branch.) |
However, the approach your taking might actually work. Just instead of removing that character, we want to add the missing part of the surrogate pair. |
Another realization I just had: the coords/numbers in the patches refers to UTF-16 units, not Unicode code point units. This is probably not what the other implementations do. It seems to not cause a problem in practice because DMP is good a recovering context, but it's certainly something to have in mind if you're using DMP across languages. |
Unfortunately it was late when I got this working, and never really had chance to give a good test in this regards. I was kind of hoping this would only be applying to prefix & suffix context, and I don't believe lengths matter here, but thinking about it there is nothing stopping a change breaking on a surrogate either.. :(
Yes, I did wonder about that. Also there is unicode's ZERO WIDTH JOINER: to maybe think about. Not sure this would be an issue, as Taking all this into consideration, it would appear the Javascript version here has many issues with unicode. Apart from the ZERO WIDTH JOINER issue most issues could be fixed by re-writing the code to use So this is what I'm going to do, and in my case should fix any problems. Basically |
Thanks for the work on this. We discovered this problem today while debugging a problem where certain edits crash Simplenote. The Objective-C library at least also demonstrates the same root problem whereby Another thought on post-processing… If instead of creating a diff_match_patch.prototype.diff_toDelta = function(diffs) {
var text = [];
var lastEnd = null;
for (var x = 0; x < diffs.length; x++) {
var thisDiff = diffs[x];
var thisTop = thisDiff[1][0];
var thisEnd = thisDiff[1][ thisDiff[1].length - 1 ];
// don't end mid-surrogate-pair
if ( isHighSurrogate( thisEnd ) ) {
thisDiff[1] = thisDiff[1].slice(0, -1);
}
// don't start mid-surrogate-pair
if ( isHighSurrogate( lastEnd ) && isLowSurrogate( thisStart ) ) {
thisDiff[1] = lastEnd + thisDiff[1];
}
lastEnd = thisEnd;
switch (diffs[x][0]) {
case DIFF_INSERT:
text[x] = '+' + encodeURI(diffs[x][1]);
break;
case DIFF_DELETE:
text[x] = '-' + diffs[x][1].length;
break;
case DIFF_EQUAL:
text[x] = '=' + diffs[x][1].length;
break;
}
}
return text.join('\t').replace(/%20/g, ' ');
}; Obviously some things here would need bounds checking but that was the idea I had. I'll try and test it out to see if I can recreate/resolve the tests cases in this thread. |
Resolves google#69 for JavaScript Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
The above link to #80 is my attempt to demonstrate working code with tests. My patch only changes the diffing where it becomes a problem - when attempting to build a valid UTF-8 string in The method appears to work and I think all it needs for totality is to handle cases where our diff groups are too short to chop off the front or back of them. |
@judofyr today I came back to your patch and I think the problem with the extended tests stems from the way the patch conflates high surrogates and low surrogates. in general, we can end a group on a low surrogate and we can start a group on a high surrogate. the reverse is problematic: ending on a high surrogate and starting on a low surrogate. that is, instead of using diff --git a/javascript/diff_match_patch_uncompressed.js b/javascript/diff_match_patch_uncompressed.js
index 88a702c..c731b8c 100644
--- a/javascript/diff_match_patch_uncompressed.js
+++ b/javascript/diff_match_patch_uncompressed.js
@@ -88,6 +88,15 @@ diff_match_patch.Diff.prototype.toString = function() {
return this[0] + ',' + this[1];
};
+diff_match_patch.prototype.is_high_surrogate = function(c) {
+ var v = c.charCodeAt(0);
+ return v >= 0xD800 && v <= 0xDBFF;
+};
+
+diff_match_patch.prototype.is_low_surrogate = function(c) {
+ var v = c.charCodeAt(0);
+ return v >= 0xDC00 && v <= 0xDFFF;
+};
/**
* Find the differences between two texts. Simplifies the problem by stripping
@@ -439,6 +448,12 @@ diff_match_patch.prototype.diff_bisect_ = function(text1, text2, deadline) {
*/
diff_match_patch.prototype.diff_bisectSplit_ = function(text1, text2, x, y,
deadline) {
+ if (this.is_low_surrogate(text1, x)) {
+ x--;
+ }
+ if (this.is_low_surrogate(text2, y)) {
+ y--;
+ }
var text1a = text1.substring(0, x);
var text2a = text2.substring(0, y);
var text1b = text1.substring(x);
@@ -569,6 +584,14 @@ diff_match_patch.prototype.diff_commonPrefix = function(text1, text2) {
}
pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
}
+
+ if (
+ this.is_high_surrogate(text1[pointermid - 1]) ||
+ this.is_high_surrogate(text2[pointermid - 1])
+ ) {
+ return pointermid - 1;
+ }
+
return pointermid;
};
@@ -601,6 +624,14 @@ diff_match_patch.prototype.diff_commonSuffix = function(text1, text2) {
}
pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
}
+
+ if (
+ this.is_low_surrogate(text1[text1.length - pointermid]) ||
+ this.is_low_surrogate(text2[text2.length - pointermid])
+ ) {
+ return pointermid - 1;
+ }
+
return pointermid;
};
diff --git a/javascript/tests/diff_match_patch_test.js b/javascript/tests/diff_match_patch_test.js
index 109e56a..ede410c 100644
--- a/javascript/tests/diff_match_patch_test.js
+++ b/javascript/tests/diff_match_patch_test.js
@@ -89,6 +89,17 @@ function testDiffCommonPrefix() {
// Whole case.
assertEquals(4, dmp.diff_commonPrefix('1234', '1234xyz'));
+
+ // surrogate pairs
+ assertEquals(0, dmp.diff_commonPrefix(
+ '\ud83c\udd70', // 🅰
+ '\ud83c\udd71', // 🅱
+ ));
+
+ assertEquals(2, dmp.diff_commonPrefix(
+ '\ud83c\udd70\ud83c\udd70', // 🅰🅰
+ '\ud83c\udd70\ud83c\udd71', // 🅰🅱
+ ));
}
function testDiffCommonSuffix() {
@@ -101,6 +112,17 @@ function testDiffCommonSuffix() {
// Whole case.
assertEquals(4, dmp.diff_commonSuffix('1234', 'xyz1234'));
+
+ // surrogate pairs
+ assertEquals(0, dmp.diff_commonSuffix(
+ '\ud83c\udd70', // 🅰🅰
+ '\ud83d\udd70', // 🅰🕰
+ ));
+
+ assertEquals(2, dmp.diff_commonSuffix(
+ '\ud83c\udd70\ud83c\udd70', // 🅰🅰🅰
+ '\ud83d\udd70\ud83c\udd70', // 🅰🕰🅰
+ ));
}
function testDiffCommonOverlap() { All in all I think the approach in #80 is stronger because the assumption of working In another test case we came up with one of the semantic cleanup functions is returning dmp.diff_main(
'😃🖖🏻🖖🏻🖖🏿\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\n\n.👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽s👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿.\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\n.👉🏽👈🏿👈🏿👈🏿🥶🥶🥶🥶🥶👈🏼🖖🏻👈🏼s🖖🏻👈🏼😋\n\ndasdas\ndasvafksdldasfadsxc vzx',
'😃🖖🏻🖖🏻🖖🏿\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉\nfds\n.👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽👉🏽s👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿👈🏿.\n🥶🥶🥶👈🏼🖖🏻😉🖖🏽🖖🏿😉fdsvadsfewdsfdsafsdafsd'
) // produces an empty `[ DELETE, "" ]` group in the middle of the string |
Aah. Maybe. I always confuse what "low" and "high" means.
I feel like the approach in this PR is probably the "correct" way of doing it, but it requires detailed knowledge about how the algorithm works and I don't feel that comfortable with it. From a practical point of view #80 is much simpler and likelier to handle all cases. I like it! They are also not mutually exclusive. We can introduce #80 and then also try to make the core algorithm work properly on UTF16.
Ouch. That's good to know. The biggest problem here is the maintenance of the project. There doesn't seem to be any progress here. We at @sanity-io are considering forking the JS version and incorporating the fixes in here and/or #80, but currently it's not been a priority. |
Thanks @judofyr for your quick response!
That's been our finding too. It was that empty My reasoning for #80 is purely because nothing else in the code depends on or assumes valid Unicode data, so I frame this PR as "we don't have to do this but we are because it prevents a number of confusing situations" - I am quite fond of the fact that we have the liberty to make these semantically-equivalent diff changes. Where I ran into trouble here were the places you mentioned - I don't know the algorithm well enough given the code and in some strange sense I suppose it might be easier to rewrite the entire file with a tighter Unicode assumption/constraint. Well, my point is that I think that the approach in this PR is incomplete until we make the same surrogate checks everywhere, or at least it's hard to reason about the fix until that has been done or we find more comprehensive test cases.
I've been thinking of this too. @NeilFraser what can we do to prevent forking? How can we help you with this? It's amazing that the repository has been moved to GitHub and that you've updated so much of it. We're happy to help steward what we can since Simplenote/Simperium depends so fundamentally on |
See google#69 In this patch I'm trying to see if we can scan through strings in the library while preserving Unicode semantics if we stay close to native JavaScript strings. That is, by only accounting for things like surrogates when we have to. As is obvious from inspection, this work is incomplete and it was the attempt that has led me to pursue concverting the intput strings up-front into an already-split array of Unicode scalar values.
@judofyr are you using this patch at @sanity-io? I finally got around to digesting the Myers algorithm and the things that @NeilFraser has written about this library over the years (super helpful, thanks Neil!) and I'm more doubtful of this approach now. Tonight I attempted to alter the library so that it would properly handle surrogate pairs, from the ground up if you will, and that broke down much harder than I anticipated. This library relies heavily on array indexing which gets destroyed if the characters are variable-length, which indeed they are. The original Myers algorithm too depends on it, and possibly even more so. When I got to This library relies on being able to primitively scan forward and backward in a string by adding or subtracting from an index. Each time we do this and are aware of the Unicode rules we have to iterate - meaning that we turn constant operations into ones that have an unspecified number of iterations, and we do that every time we perform indexing. My main concerns revolve around finding the middle snake and then the shortest edit script. Maybe we can get away with some pin-pointed index adjustments but I feel like it was getting out of hand as I actually started to do it. That makes #80 more convenient and robust than I first realized. I found a Rust implementation of this library that performs Unicode boundary cleanup in the diffing process alongside the semantic and lossless cleanups. Apart from that and the fact that they are using string slices the fix was almost identical. Independently I'd like to try splitting the text when it enters the library and then re-joining it when it leaves. That would be a manual version of I'm afraid that no matter what we do to fix it the performance will suffer. In other words, it's really fast because it gives the wrong answer for cases where it would otherwise be slow 🙃 |
See google#69 In this patch I'm trying to follow the approach taken by @judofyr but starting from the top of the script and going through, auditing every place that perform string operations that split, index, or otherwise operate on a character level so that we can make sure that we don't split surrogate pairs. This contrasts with [attempt one] where I created a custom iterator for strings. Surprisingly I found this more "ad-hoc" approach easier to manage since it doesn't create a split universe of string/Unicode. As of this commit I haven't audited the cleanup functions but my own tests are passing so I'm given to believe that they might be safe. I have my own doubts that this is sound work and that the middle-snake algorithm might find the wrong snake when presented with variable-width characters.
See google#69 In this patch I'm trying to follow the approach taken by @judofyr but starting from the top of the script and going through, auditing every place that perform string operations that split, index, or otherwise operate on a character level so that we can make sure that we don't split surrogate pairs. This contrasts with [attempt one] where I created a custom iterator for strings. Surprisingly I found this more "ad-hoc" approach easier to manage since it doesn't create a split universe of string/Unicode. As of this commit I haven't audited the cleanup functions but my own tests are passing so I'm given to believe that they might be safe. I have my own doubts that this is sound work and that the middle-snake algorithm might find the wrong snake when presented with variable-width characters.
@@ -601,6 +619,9 @@ diff_match_patch.prototype.diff_commonSuffix = function(text1, text2) { | |||
} | |||
pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin); | |||
} | |||
if (insideSurrogate(text1, text1.length - pointermid)) { | |||
pointermid--; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with this instead, but I don't have a test that fails in either case so obviously I'm missing a test case
if (pointer mid < length - 1 && this.isLowSurrogate(text1[poitnermid])) {
pointermid++;
}
my reasoning was that if the common suffix starts on a low surrogate then we have to bump it to the right so that the surrogate half doesn't end up in the common part - backwards from commonPrefix
where we want to shift to the left in order to avoid the same for high surrogates
See google#69 In this patch I'm trying to follow the approach taken by @judofyr but starting from the top of the script and going through, auditing every place that perform string operations that split, index, or otherwise operate on a character level so that we can make sure that we don't split surrogate pairs. This contrasts with [attempt one] where I created a custom iterator for strings. Surprisingly I found this more "ad-hoc" approach easier to manage since it doesn't create a split universe of string/Unicode. As of this commit I haven't audited the cleanup functions but my own tests are passing so I'm given to believe that they might be safe. I have my own doubts that this is sound work and that the middle-snake algorithm might find the wrong snake when presented with variable-width characters.
So this may not be the best place to hold this discussion but I've had a hard time coming to…terms…with how the algorithm handles variable-width letters/alphabets. What I'm nervous about with in-place adjustments like Creating the iterator was itself a bad idea because it created the two-worlds divide between JS strings and Unicode Scalar Values. It turned out to be much easier to reason about the code and "backup" when necessary (or advance), as I have been doing in 00260dd. So far it looks better than I expected it to be with the backup/advance logic. Put in different words I don't consider it my work so much as my attempt at reviewing this PR, which I had to do by starting over from principle and seeing how close our patches came. I haven't reviewed the Where did we differ? I found that we need to perform surrogate checks in However, I think it may still be necessary inside
Here the substring
I'm worried again though that we're going to have to start re-scanning the string inside Another idea I have been pondering is that we could scan the input string at the start of the program and for each character relate the index into the string with the index into the equivalent Unicode Scalar Value array // how many USVs are we in?
'abc\ud83c\udd70\ud83c\udd71xyz'
[0123344567]
Right now what I see as the big problems are:
Maybe if I can chart out every counter and label which coordinates they are using this will become clearer. In any case, to find the middle snake I think we have to perform at minimum one translation into USVs if for no other reason than to figure out That's it for me for tonight 😴 |
@dmsnell: Very impressive work! I'll have to admit that the cases where I added Unicode rewinding was mainly determined "experimentally": I added various test cases and tried to figure out where the patch was created. What you've done is far more extensive. At Sanity.io we're still using the current version which is buggy. We were hoping that the work in #80 would be accepted and merged in. |
I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue Hope it's helpful to make this solution more robust. |
Resolves google#69 for JavaScript Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Java Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Objective-C Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python2 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python3 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Stop breaking surrogate pairs in toDelta()/fromDelta() Resolves google#69 for the following languages: - Objective-C - Java - JavaScript - Python2 - Python3 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries. Alternative approaches: ========= - The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/) library in Rust takes a more comprehensive approach with its `cleanup_char_boundary()` method. Since that approach resolves the issue everywhere and not just in to/from Delta, it's worth exploring as a replacement for this patch. Remaining work to do: ======== -[ ] Fix CPP or verify not a problem -[ ] Fix CSharp or verify not a problem -[ ] Fix Dart or verify not a problem -[ ] Fix Lua or verify not a problem -[x] Refactor to use cleanupSplitSurrogates in JavaScript -[x] Refactor to use cleanupSplitSurrogates in Java -[ ] Refactor to use cleanupSplitSurrogates in Objective C -[ ] Refactor to use cleanupSplitSurrogates in Python2 -[ ] Refactor to use cleanupSplitSurrogates in Python3 -[ ] Refactor to use cleanupSplitSurrogates in CPP -[ ] Refactor to use cleanupSplitSurrogates in CSharp -[ ] Refactor to use cleanupSplitSurrogates in Dart -[ ] Refactor to use cleanupSplitSurrogates in Lua -[x] Fix patch_toText in JavaScript -[ ] Fix patch_toText in Java -[ ] Fix patch_toText in Objective C -[ ] Fix patch_toText in Python2 -[ ] Fix patch_toText in Python3 -[ ] Fix patch_toText in CPP -[ ] Fix patch_toText in CSharp -[ ] Fix patch_toText in Dart -[ ] Fix patch_toText in Lua -[ ] Figure out a "minimal" set of unit tests so we can get rid of the big chunk currently in the PR, then carry it around to all the libraries. The triggers are well understood, so we can write targeted tests instead of broad ones.
Resolves google/diff-match-patch#69 for Objective-C Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Stop breaking surrogate pairs in toDelta()/fromDelta() Resolves google/diff-match-patch#69 for the following languages: - Objective-C - Java - JavaScript - Python2 - Python3 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries. Alternative approaches: ========= - The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/) library in Rust takes a more comprehensive approach with its `cleanup_char_boundary()` method. Since that approach resolves the issue everywhere and not just in to/from Delta, it's worth exploring as a replacement for this patch. Remaining work to do: ======== -[ ] Fix CPP or verify not a problem -[ ] Fix CSharp or verify not a problem -[ ] Fix Dart or verify not a problem -[ ] Fix Lua or verify not a problem -[x] Refactor to use cleanupSplitSurrogates in JavaScript -[x] Refactor to use cleanupSplitSurrogates in Java -[ ] Refactor to use cleanupSplitSurrogates in Objective C -[ ] Refactor to use cleanupSplitSurrogates in Python2 -[ ] Refactor to use cleanupSplitSurrogates in Python3 -[ ] Refactor to use cleanupSplitSurrogates in CPP -[ ] Refactor to use cleanupSplitSurrogates in CSharp -[ ] Refactor to use cleanupSplitSurrogates in Dart -[ ] Refactor to use cleanupSplitSurrogates in Lua -[x] Fix patch_toText in JavaScript -[ ] Fix patch_toText in Java -[ ] Fix patch_toText in Objective C -[ ] Fix patch_toText in Python2 -[ ] Fix patch_toText in Python3 -[ ] Fix patch_toText in CPP -[ ] Fix patch_toText in CSharp -[ ] Fix patch_toText in Dart -[ ] Fix patch_toText in Lua -[ ] Figure out a "minimal" set of unit tests so we can get rid of the big chunk currently in the PR, then carry it around to all the libraries. The triggers are well understood, so we can write targeted tests instead of broad ones.
The following PR makes the JS algorithm aware of surrogate pairs and will avoid splitting a pair into two different Diff objects.
I think there might be more places which needs to be surrogate-pair-aware, but I haven't been able to find test cases to confirm this.
Fixes #59, #68, #10