Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Patch margin index splits Unicode surrogate pairs unexpectedly #149

Open
kkshinkai opened this issue Oct 27, 2023 · 5 comments
Open

Patch margin index splits Unicode surrogate pairs unexpectedly #149

kkshinkai opened this issue Oct 27, 2023 · 5 comments

Comments

@kkshinkai
Copy link

When the function diff_match_patch.prototype.patch_addContext_ adds context to a patch, it increments/decreases the index by a constant, Patch_Margin = 4. However, since JavaScript's substring function operates with UTF-16 code unit indexing, there's a chance that Patch_Margin may split a Unicode surrogate pair.

Consider the following example:

import diff_match_patch from "diff-match-patch";

console.log(
  JSON.stringify(
    new diff_match_patch().patch_make("🧮 **a", "🧮 **")[0].diffs[0][1],
  )
);

The output is "\uddee **" (🧮 corresponds to "\ud83e\uddee").

If you attempt to use diff_match_patch.patch_obj.prototype.toString on this patch, it leads to a crash. encodeURI will throw a URIError if URI contains a lone surrogate.

import diff_match_patch from "diff-match-patch";

const diff = new diff_match_patch();

console.log(
  JSON.stringify(
    diff.patch_toText(diff.patch_make("🧮 **a", "🧮 **")) // URIError: URI malformed
  )
);

A straightforward solution might involve adding a verification step after applying Patch_Margin to ensure the indices remain valid. I can start a PR, but I've noticed that Patch_Margin is used in many places, and I'm unsure about the best way to make changes.

@dmsnell
Copy link

dmsnell commented Oct 28, 2023

this is a good catch, @kkshinkai - I think I've seen this reported elsewhere too but I can't find it now. there should be a relatively easy operation here, which is to backup and include a previous code unit if we're on a split surrogate.

if ( isLowSurrogate( margin[0] ) ) {
	margin.prepend( document[ startOfContext - 1 ] );
}

this oversimplifies it because we could start a document on a low surrogate, and then we'd need to cut it out, but I think all that has to happen is include up to one additional character. it's not changing the contract of making a margin either because that additional code unit is still part of a character we've already expanded into.

happy to review any patch you want. I haven't looked at Patch_Margin yet, but maybe we could handle this better by creating a helper function to return a margin length.

padding += this.get_margin_before(path2.start, this.Patch_Margin);
pattern = text.substring(patch.start2 - padding,
                         patch.start2 + patch.length1 + padding);

in other words, a get_margin_before( offset, desired_margin ) and get_margin_after( offset, desired_margin ) function so that everything which currently gets a static margin can get one that won't break URIEncode.

@jcubic
Copy link

jcubic commented Dec 20, 2023

Does it mean that the library is not yet safe to use with Emoji?

@keizo
Copy link

keizo commented Jan 4, 2024

I think I've just run into this issue. Not sure yet, but seems to make sense. Will report back if I learn anything on closer inspection.

@dmsnell
Copy link

dmsnell commented Jan 4, 2024

Does it mean that the library is not yet safe to use with Emoji?

Yes and no. It works for many inputs, but we need to merge in the fixes to eliminate splitting surrogate pairs when generating these outputs.

diff-match-patch doesn't particularly care about Unicode, which is where some of the issues creep in. It's the URL-encoding that cares and crashes or produces corruption, such as the Unicode replacement character when it should be an Emoji that was split in half.

It's not only Emoji that are affected, of course, but Emoji are the most notable code points. Everything outside the Basic Multilingual Plane will experience this problem.

There's another sneaky problem for deleted spans of text when using versions of diff-match-patch from different programming languages (and in some cases, builds of a programming language, like Python2 compiled with and without wide character support). They may report different lengths because each library represents the basic meaning of character length as returned from functions like String.length, and those lengths differ.

In #80 I've fixed the delta format we use in Simplenote, but that's not merged into the main code. I also keep saying I'm going to fork this repo, but without any official announcement in the README that this library is unmaintained it seems silly to maintain a second copy that nobody will find 🤷‍♂️. You are free to use that PR's branch and if you propose a new PR to fix the add_context functions I'll be happy to review the work and provide feedback.

@keizo
Copy link

keizo commented Jan 4, 2024

Thanks for your work dmsnell, you went deep on this! I'm not sure I've wrapped my head around all that's being discussed, but I used your version here and it seems to fix my problem. So thank you. I'll keep an eye out for further updates.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants